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Abstract 

This paper describes a novel method to solve average-reward semi-Markov decision pro¬ 
cesses, by reducing them to a minimal sequence of cumulative reward problems. The usual 
solution methods for this type of problems update the gain (optimal average reward) imme¬ 
diately after observing the result of taking an action. The alternative introduced, optimal 
nudging, relies instead on setting the gain to some fixed value, which transitorily makes 
the problem a cumulative-reward task, solving it by any standard reinforcement learning 
method, and only then updating the gain in a way that minimizes uncertainty in a minmax 
sense. The rule for optimal gain update is derived by exploiting the geometric features of 
the w — I space, a simple mapping of the space of policies. The total number of cumu¬ 
lative reward tasks that need to be solved is shown to be small. Some experiments are 
presented to explore the features of the algorithm and to compare its performance with 
other approaches. 

Keywords: Reinforcement Learning, Average Rewards, Semi-Markov Decision Processes. 

1. Introduction 

Consider a simple game, some solitaire variation, for example, or a board game against a 
fixed opponent. Assume that neither draws nor unending matches are possible in this game 
and that the only payouts are -|-$1 on winning and —$1 on losing. If each new game position 
(state) depends stochastically only on the preceding one and the move (action) made, but 
not on the history of positions and moves before that, then the game can be modelled as a 
Markov decision process. Further, since all matches terminate, the process is episodic. 

Learning the game, or solving the decision process, is equivalent to hnding the best 
playing strategy (policy), that is, determining what moves to make on each position in 
order to maximize the probability of winning / expected payout. This is the type of problem 
commonly solved by cumulative-reward reinforcement learning methods. 
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Now, assume that, given the nature of this game, as is often the case, the winning prob¬ 
ability is optimized by some cautious policy, whose gameplay favours avoiding risks and 
hence results in relatively long matches. For example, assume that for this game a policy 
is known to win with certainty in 100 moves. On the other hand, typically some policies 
trade performance (winning probability) for speed (shorter episode lengths). Assume an¬ 
other known policy, obviously sub-optimal in the sense of expected payout, has a winning 
probability of just 0.6, but it takes only 10 moves in average to terminate. 

If one is going to play a single episode, doubtlessly the first strategy is the best available, 
since following it winning is guaranteed. However, over a sequence of games, the second 
policy may outperform the ‘optimal’ one in a very important sense: if each move costs the 
same (for instance, if all take the same amount of time to complete), whereas the policy that 
always wins receives in average $0.01/move, the other strategy earns twice as much. Thus, 
over an hour, or a day, or a lifetime of playing, the ostensibly sub-optimal game strategy 
will double the earnings of the apparent optimizer. This is a consequence of the fact that 
the second policy has a higher average reward, receiving a larger payout per action taken. 
Finding policies that are optimal in this sense is the problem solved by aver age-reward 
reinforcement learning. 

In a more general case, if each move has associated a different cost, such as the time 
it would take to move a token on a board the number of steps dictated by a die, then 
the problem would be average-reward semi-Markov, and the goal would change to finding 
a policy, possibly different from either of the two discussed above, that maximizes the 
expected amount of payout received per unit of action cost. 

Average-reward and semi-Markov tasks arise naturally in the areas of repeated episodic 
tasks, as in the example just discussed, queuing theory, autonomous robotics, and quality 
of service in communications, among many others. 

This paper presents a new algorithm to solve average-reward and semi-Markov decision 
processes. The traditional solutions to this kind of problems require a large number of 
samples, where a sample is usually the observation of the effect of taking an action from 
a state: the cost of the action, the reward received and the resulting next state. For each 
sample, the algorithms basically update the gain (average reward) of the task and the 
gain-adjusted value of, that is, what a good idea is it, taking that action from that state. 

Some of the methods in the literature that follow this solution template are R-learning 
(Schwartz, 1993), the Algorithms 3 and 4 by Singh (1994), SMART (Das et ah, 1999), and 
the “New Algorithm” by Gosavi (2004). Table in Section 3.2 introduces a more compre¬ 
hensive taxonomy of solution methods for average-reward and semi-Markov problems. 

The method introduced in this paper, optimal nudging, operates differently. Instead of 
rushing to update the gain after each sample, it is temporarily fixed to some value, resulting 
in a cumulative-reward task that is solved (by any method), and then, based on the solution 
found, the gain is updated in a way that minimizes the uncertainty range known to contain 
its optimum. 

The main contribution of this paper is the introduction of a novel algorithm to solve 
semi-Markov (and simpler average-reward) decision processes by reducing them to a minimal 
sequence of cumulative-reward tasks, that can be solved by any of the fast, robust existing 
methods for that kind of problems. Hence, we refer to the method used to solve this tasks 
as a ‘black box’. 
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The central step of the optimal nudging algorithm is a rule for updating the gain between 
calls to the ‘black-box’ solver, in such a way that after solving the resulting cumulative- 
reward task, the worst case for the associated uncertainty around the value of the optimal 
gain will be the smallest possible. 

The update rule exploits what we have called a “Bertsekas split” of each task as well as 
the geometry of the w — I space, a mapping of the policy space into the interior of a small 
triangle, in which the convergence of the solutions of the cumulative-reward tasks to the 
optimal solution of the average-reward problem can be easily intuited and visualized. 

In addition to this, the derivation of the optimal nudging update rule yields an early 
termination condition, related to sign changes in the value of a reference state between 
successive iterations for which the same policy is optimal. This condition is unique to 
optimal nudging, and no equivalent is possible for the preceding algorithms in the literature. 

The complexity of optimal nudging, understood as the number of calls to the “black¬ 
box” routine, is shown to be at worst logarithmic on the (inverse) desired final uncertainty 
and on an upper bound on the optimal gain. The number of samples required in each call 
is in principle inherited from the “black box”, but also depends strongly on whether, for 
example, transfer learning is used and state values are not reset between iterations. 

Among other advantages of the proposed algorithm over other methods discussed, two 
key ones are requiring adjustment of less parameters, and having to perform less updates 
per sample. 

Finally, the experimental results presented show that the performance of optimal nudg¬ 
ing, even without fine tuning, is similar or better to that of the best usual algorithms. The 
experiments also illuminate some particular features of the algorithm, particularly the great 
advantage of having the early termination condition. 

The rest of the paper is structured as follows. Section 2 formalizes the problem, defin¬ 
ing the different types of Markov decision processes (cumulative- and average-reward and 
semi-Markov) under an unified notation, introducing the important unichain condition and 
describing why it is important to assume that it holds. 

Section 3 presents a summary of solution methods to the three types of processes, 
emphasizing the distinctions between dynamic programming and model-based and model- 
free reinforcement learning algorithms. This section also introduces a new taxonomy of the 
average-reward algorithms from the literature that allows us to propose a generic algorithm 
that encompasses all of them. Special attention is given in this Section to the family of 
stochastic shortest path methods, from which the concept of the Bertsekas split is extracted. 
Finally, a motivating example task is introduced to compare the performance of some of 
the traditional algorithms and optimal nudging. 

In Section 4, the core derivation of the optimal nudging algorithm is presented, starting 
from the idea of nudging and the definition of the w — I space and enclosing triangles. The 
early termination condition by zero crossing is presented as an special case of reduction of 
enclosing triangles, and the exploration of optimal reduction leads to the main Theorem 
and final proposition of the algorithm. 

Section 5 describes the complexity of the algorithm by showing that it outperforms a 
simpler version of nudging for which the computation of complexity is straightforward. 

Finally, Section 6 presents results for a number of experimental set-ups and in Section 
7 some conclusions and lines for future work are discussed. 
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2. Problem Definition 


In this section, Markov decision processes are described and three different reward max¬ 
imization problems are introduced: expected cumulative reinforcement, average-reward, 
and average reward semi-Markov models. Average-reward problems are a subset of semi- 
Markov average reward problems. This paper introduces a method to solve both kinds of 
average-reward problems as a minimal sequence of cumulative-reward, episodic processes. 
Two relevant common assumptions in average reward models, that the unichain condition 


holds (Ross, 1970) and a recurrent state exists (Bertsekas, 1998; Abounadi et ah, 2002), are 


described and discussed at the end of the section. 

In all cases, an agent in an environment observes its current state and can take actions 
that, following some static distribution, lead it to a new state and result in a real-valued 
reinforcement/reward. It is assumed that the Markov property holds, so the next state 
depends only on the current state and action taken, but not on the history of previous 
states and actions. 


2.1 Markov Decision Processes 


An infinite-horizon Markov decision process (MDP, [Sutton and Bar^ 1998; Puterman 


1994) is defined minimally as a four-tuple (5,A, ^P, ^). S is the set of states in the environ¬ 


ment. A is the set of actions, with Ag equal to the subset of actions available to take from 
state s and A = We assume that both S and A are finite. The stationary function 

fP : 5 X 5 X A —7- [0,1] defines the transition probabilities of the system. After taking action 
o from state s, the resulting state is s' with probability ^P“^/ = P(s'|s,a). Likewise, 
denotes the real-valued reward observed after taking action a and transitioning from state 
s to s'. For notational simplicity, we define r(s,a) = E |s,a]. At decision epoch t, the 
agent is in state st, takes action at, transitions to state st+i and receives reinforcement rt+i, 
which has expectation r(st,at). 

If the task is episodic, there must be a terminating state, defined as transitioning to 
itself with probability 1 and reward 0. Without loss of generality, multiple terminating 
states can be treated as a single one. 

An element vr : 5 —)• A of the policy space 11 is a rule or strategy that dictates for each 
state s which action to take, it{s). We are only concerned with deterministic policies, in 
which each state has associated a single action, to take with probability one. This is not too 


restrictive, since Puterman (1994) has shown that, if an optimal policy exists, an optimal 


deterministic policy exists as well. Moreover, policies are assumed herein to be stationary. 
The value of a policy from a given state, u’^(s), is the expected cumulative reward observed 
starting from s and following vr. 


= E 


OO 

E 

Li=0 


Jurist, TT{st)) I So = s, vr 


( 1 ) 


where 0 < 7 < 1 is a discount factor with 7 = 1 corresponding to no discount. 

The goal is to find a policy that maximizes the expected reward. Thus, an optimal policy 
vr* has maximum value for each state; that is, 

vr*(s) G argmaxu^(s) Vs e S , 

ttGTI 
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so 


v*{s) = > 'y^(s) Vs G 5, TT G n 


Remark 1 The discount factor ensures convergence of the infinite sum in the policy values 
Equation 0, so it is used to make value bounded if rewards are bounded, even in problems 
and for policies for which episodes have infinite duration. Ostensibly, introducing it makes 
rewards received sooner more desirable than those received later, which would make it useful 
when the goal is to optimize a measure of immediate (or average) reward. However, for 
the purposes of this paper, the relevant policies for the discussed resulting MDPs will be 
assumed to terminate eventually with probability one from all states, so the infinite sum will 
converge even without discount. Furthermore, the perceived advantages of discount are less 


sturdy than initially apparent (Mahadevan, 1994), discounting is not guaranteed to lead 
to gain optimality (Uribe et al., 2011). Thus, no discount will be used in this paper ('y = 1). 


2.2 Average Reward MDPs 

The aim of the average reward model in infinite-horizon MDPs is to maximize the reward 


received per step (Puterman, 1994 Mahadevan, 1996). Without discount all non-zero-valued 


policies would have signed infinite value, so the goal must change to obtaining the largest 
positive or the smallest negative rewards as frequently as possible. In this case, the gain of 
a policy from a state is defined as the average reward received per action taken following 
that policy from the state, 


PARis) = lim -E 

n^oo ji 


'n—1 


^r(si,7r(st)) 


5o — S, TT 


lt=0 


A gain-optimal policy, vr^^, has maximum average reward, for all states, 

P*ar{^) ^ Par{^) Vs G 5, vr G n . 

A finer typology of optimal policies in average-reward problems discriminates bias- 
optimal policies which, besides being gain-optimal, also maximize the transient reward 
received before the observed average approaches For a discussion of the differences, 

see the book by Puterman (1994). This paper will focus on the problem of finding gain- 
optimal policies. 


2.3 Discrete Time Semi-MDPs 

In the average-reward model all state transitions weigh equally. Equivalently, all actions 
from all states are considered as having the same—unity—duration or cost. In semi-Markov 
decision processes (SMDPs, Ross, 1970) the goal remains maximizing the average reward 


received per action, but all actions are not required to have the same weight. 

2.3.1 Transition Times 

The usual description of SMDPs assumes that, after taking an action, the time to transition 


to a new state is not constant (Feinberg, 1994 Das et al., 1999; Baykal-Giirsoy and Giirsoy 
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2007; Ghavamzadeh and Mahadevan, 2007). Formally, at decision epoch t the agent is in 


state St and only after an average of Nt seconds of having taken action o* it evolves to state 
Si+i and observes reward n+i. 

The transition time function, then, is £A^:5x5x^ —?• 1R+ (where R+ is the set of 
positive real numbers). Also, since reward can possibly lump between decision epochs, its 
expectation, r{st,at), is marginalized over expected transition times as well as new states. 
Consequently, the gain of a policy from a state becomes 


E 


Psm{s) = 

n—>-cci 


n—l 


J^r(st,7r(st)) 


Sq — 5, TT 


t=0 


E 


'n—l 

^Ntl So = S, TT 

.t=o 


2.3.2 Action Costs 

We propose an alternative interpretation of the SMDP framework, in which taking all 
actions can yield constant-length transitions, while consuming varying amounts of some 
resources (for example time, but also energy, money or any combination thereof). This 
results in the agent observing a real-valued action cost kt+i, which is not necessarily related 
to the reward rt+i, received after taking action at from state st. As above, the assumption is 
that cost depends on the initial and final states and the action taken and has an expectation 
of the form k{s,a). In general, all costs are supposed to be positive, but for the purposes 
of this paper this is relaxed to requiring that all policies have positive expected cost from 
all states. Likewise, without loss of generality it will be assumed that all action costs either 
are zero or have expected magnitude greater than or equal to one, 

|/c(s,a)|>l Vfe(s,o)/0 . 

In this model, a policy vr has expected cost 


c^{s) = lim E 

n^oo 


'n—l 

E 

Lt=o 


k{st,TT{st)) I So = S, vr 


with 


c’^(s) >1 Vs G 5, vr G n . 

Observe that both definitions are analytically equivalent. That is, Nt and fe(st,7r(st)) 
have the same role in the gain. Although their definition and interpretations varies— 
expected time to transition versus expected action cost—both give origin to identical prob¬ 
lems with gain 

Naturally, if all action costs or transition times are equal, the semi-Markov model reduces 
to average rewards, up to scale, and both problems are identical if the costs/times are unity¬ 
valued. For notational simplicity, from now on we will refer to the gain in both problems 
simply as p. 
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2.3.3 Optimal policies 
A policy TT* with gain 


E 


(s) = p*{s) = lim 

n^oo 


'n—1 

'^r{st,TT{st)) I so = s, TT* 
lt=0 


E 


n—1 

^k(st,7r(st)) I So = s, tt* 
Lt=o 


v'^*(s) 

c^*(s) 


is gain-optimal if 


P*(s) > p'^(s) Vs € S, TT € n , 

similarly to the way it was defined for the average-reward problem. 

Remark 2 Observe that the gain-optimal policy does not neeessarily maximize v'^, nor does 
it minimize . It only optimizes their ratio. 

The following two sections discuss two technical assumptions that are commonly used 
in the average-reward and semi-Markov decision process literature to simplify analysis, 
guaranteeing that optimal stationary policies exist. 


2.4 The Unichain Assumption 


The transition probabilities of a fixed deterministic policy vr £ IT define a stochastic matrix, 
that is, the transition matrix of a homogeneous Markov chain on 5. In that embedded 
chain, a state is called transient if, after a visit, there is a non-zero probability of never 
returning to it. A state is recurrent if it is not transient. A recurrent state will be visited 
in finite time with probability one. A recurrent class is a set of recurrent states such that 


no outside states can be reached by states inside the set. (Kemeny and Snell, 1960) 


An MDP is called multichain if at least one policy has more than one recurrent class, 
and unichain if every policy has only one recurrent class. In an unichain problem, for all 
vr £ n, the state space can be partitioned as 


5 = U T” 


( 2 ) 


where EV is the single recurrent class and T'^ is a (possibly empty) transient set. Observe 
that these partitions can be unique to each policy; the assumption is to have a single 
recurrent class per policy, not a single one for the whole MDP. 

If the MDP is multichain, a single optimality expression may not suffice to describe 
the gain of the optimal policy, stationary optimal policies may not exist, and theory and 
algorithms are more complex. On the other hand, if it is unichain, clearly for any given vr all 
states will have the same gain, p'"{s) = which simplifies the analysis and is a sufficient 


condition for the existence of stationary, gain-optimal policies. (Puterman, 1994) 


Consequently, most of the literature on average reward MDPs and SMDPs relies on 


the assumption that the underlying model is unichain (see for example Mahadevan, 1996 
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Ghavamzadeh and Mahadevan, 2007 and references thereon). Nevertheless, the problem of 
deciding whether a given MDP is unichain is not trivial. In fact, Kallenberg (2002) posed 


the problem of whether a polynomial algorithm exists to determine if an MDP is unichain. 


which was answered negatively by Tsitsiklis (2007), who proved that it is A^P-hard. 


2.5 Recurrent States 

The term recurrent is also used, confusingly, to describe a state of the decision process that 
belongs to a recurrent class of every policy. The expression “recurrent state” will only be 
used in this sense from now on in this paper. Multi- and unichain processes may or may 


not have recurrent states. However, Feinberg and Yang (2008) proved that a recurrent state 


can be found or shown not to exist in polynomial time (on |5| and \.^\), and give methods 
for doing so. They also proved that, if a recurrent state exists, the unichain condition can 
be decided in polynomial time to hold, and proposed an algorithm for doing so. 

Instead of actually using those methods, which would require a full knowledge of the 
transition probabilities that we do not assume, we emphasize the central role of the recurrent 
states, when they exist or can be induced, in simplifying analysis. 

Remark 3 Provisionally, for the derivation below we will require all problems to be unichain 
and to have a recurrent state. However these two requirements will be further qualified in 
the experimental results. 

3. Overview of Solution Methods and Related Work 

This section summarizes the most relevant solution methods for cumulative-reward MDPs 
and average-reward SMDPs, with special emphasis on stochastic shortest path algorithms. 
Our proposed solution to average-reward SMDPs will use what we call a Bertsekas split 
from these algorithms, to convert the problem into a minimal sequence of MDPs, each of 
which can be solved by any of the existing cumulative-reward methods. At the end of the 


section, a simple experimental task from Sutton and Barto (1998) is presented to examine 


the performance of the discussed methods and to motivate our subsequent derivation. 


3.1 Cumulative Rewards 


Cumulative-reward MDPs have been widely studied. The survey of 

Kaelbling et al.| (|1996| 

and the books by 

Bertsekas and Tsitsiklis ( 

1996) 

Sutton and Barto 

1998 

), and 

Szepesvari 


(2010) include comprehensive reviews of approaches and algorithms to solve MDPs (also 
called reinforcement learning problems). We will present a brief summary and propose 
a taxonomy of methods that suit our approach of accessing a “black box” reinforcement 
learning solver. 

In general, instead of trying to find policies that maximize state value from Equation 
(j^ directly, solution methods seek policies that optimize the state-action pair (or simply 
^action') value function, 


Q^{s, a) = E 


OO 

E 

.t=o 


r{st,at) \ So = s, ao = a, TT 


( 3 ) 









































which is defined as the expected cumulative reinforcement after taking action a in state s 
and following policy tt thereafter. 

The action value of an optimal policy vr* corresponds to the solution of the following, 
equivalent, versions of the Bellman optimality equation (jSutton and Barto 1998), 


Q-(s, a)=Y.K. 


+ max(5*(s^ a) 


= E 


n+i + max(5*(s', o')|st = s, at = a 


( 4 ) 

( 5 ) 


Dynamic programming methods assume complete knowledge of the transitions ^P and 
rewards ^ and seek to solve Equation Q directly. An iteration of a type algorithm finds 
{policy evaluation) or approximates {value iteration) the value of the current policy and 
subsequently sets as current a policy that is greedy with respect to the values found {gen¬ 
eralized policy iteration). Puterman (1994) provides a very comprehensive summary of 


dynamic programming methods, including the use of linear programming to solve this kind 
of problems. 

If the transition probabilities are unknown, in order to maximize action values it is 
necessary to sample actions, state transitions, and rewards in the environment. Model-based 
methods use these observations to approximate tP, and then that approximation to find Q* 
and TT* using dynamic programming. Methods in this family usually rely on complexity 


bounds guaranteeing performance after a number of samples (or sample complexity, Kakade 


2003) bounded by a polynomial (that is, efficient) on the sizes of the state and action sets, 
as well as other parameters. 

The earliest and most studied model-based methods are PAC-MDP algorithms (effi¬ 
ciently probably approximately correct on Markov decision processes), which minimize with 
high probability the number of future steps on which the agent will not receive near-optimal 


reinforcements. (Kearns and Singh 


1998 

2002 

), sparse sampling ( 

Kearns et al. 

2002), 


Rmax (Brafman and Tennenholtz, 2003), MBIE (Strehl and Littman, 2005), and Vmax 
(Rao and Whiteson, 2012) are notable examples of this family of algorithms. Kakade’s 


(2003) and Strehl’s (2007) dissertations, and the paper by Strehl et al. (2009) provide ex¬ 


tensive theoretical discussions on a broad range of PAC-MDP algorithms. 

Another learning framework for which model-based methods exists is KWIK {knows 


Li et al. 

2008 

Walsh et al. 

2010 


2010). In it, at any decision epoch, the agent 


must return an approximation of the transition probability corresponding to the observed 
state, action and next state. This approximation must be arbitrarily precise with high 
probability. Alternatively, the agent can acknowledge its ignorance, produce a “T” output, 
and, from the observed, unknown transition, learn. The goal in this framework is to find a 
bound on the number of T outputs, and for this bound to be polynomial on some appropriate 
parameters, including |5| and \Ji\. 

Model-free methods, on the other hand, use transition and reward observations to ap¬ 
proximate action values replacing expectations by samples in Equation ([^. The two main 

1 


1 . These other parameters often include the expression 


where 7 is the discount factor, which is obvi- 


1-7 

ously problematic when, as we assume, 7=1. However, polynomial bounds also exist for undiscounted 
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algorithms, from which most variations in the literature derive, are SARSA and Q-learning 


(Sutton and Barto, 1998). Both belong to the class of temporal difference methods. SARSA 


is an on-policy algorithm that improves approximating the value of the current policy, using 
the update rule 

Qt+i{st,at) ^ (1 - at)Qt{st,at) + at{rt+i + Qt(st+i, ot+i)) , 

where at+i is an action selected from Q for state st+i. Q-learning is an off-policy algorithm 
that, while following samples obtained acting from the current values of Q, approximates 
the value of the optimal policy, updating with the rule 


Qt+i{st,at) (1 - at) Qt{st,at) + at (n+i + maxQtist+i, a)) 


( 6 ) 


In both cases, at is a learning rate. 

Q-learning has been widely studied and used for practical applications since its proposal 


by Watkins (1989). In general, for an appropriately decaying learning rate such that 

OO 

at = oo 


(7) 


t=o 


and 


E 

t=o 


aj- < OO 


( 8 ) 


and under the assumption that all states are visited and all actions taken infinitely often, 


it is proven to converge asymptotically to the optimal value with probability one (Watkins 


and Dayan 1992). Furthermore, in discounted settings, PAG convergence bounds exist for 


the case in which every state-action pair (s, a) keeps an independent learning rate of the 
1 /IchEEETEid Ii 998 |)^ and for Q-updates in the case when a parallel 


form 


(Szepesvari 


{H-|visits to ( s,a)|} _ . 

sampler PS{‘M) (]Kearns and Singh 1999), which on every call returns transition/reward 


observations for every state-action pair, is available (Even-Dar and Mansour, [2004 Azar 


et al., 2011). 


An additional PAC-MDP, model-free version of Q-learning of interest is delayed Q- 
learning ( jStrehl et al. 2006). Although the main derivation of it is for discounted settings, 
as is usual for this kind of algorithms, building on the work of Kakade (2003) a variation 


is briefly discussed in which there is no discount but rather a hard horizon assumption, in 
which only the next H action-choices of the agent contribute to the value function. In this 
case, the bound is that, with probability (1 — J), the agent will follow an e-optimal policy, 
that is a policy derived from an approximation of Q that does not differ from the optimal 
values more than e, on all but 


O 


L{- 


steps, where L(-) is a logarithmic function on the appropriate parameters (|5| 
Usually, this latter term is dropped and the bound is instead expressed as 


1 ^ 1 , H, 


O 
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Our method of solving SMDPs will assume access to a learning method for finite-horizon, 
undiscounted MDPs. The requirements of the solution it should provide are discussed in 
the analysis of the algorithm. 


3.2 Average Rewards 


As mentioned above, average-reward MDP problems are the subset of average-reward 
SMDPs for which all transitions take one time unit, or all actions have equal, unity cost. 
Thus, it would be sufficient to consider the larger set. However, average-reward problems 
have been the subject of more research, and the resulting algorithms are easily extended to 
the semi-Markov framework, by multiplying gain by cost in the relevant optimality equa¬ 
tions, so both will be presented jointly here. 

In this section we introduce a novel taxonomy of the differing parameters in the update 
rules of the main published solution methods for average-reward-including semi-Markov- 
tasks. This allows us to propose and discuss a generic algorithm that covers all existing 
solutions, and yields a compact summary of them, presented in Table below. 


Policies are evaluated in this context using the average-adjusted sum of rewards (Puterman 


1994; Abounadi et al. 2002; Ghavamzadeh and Mahadevan, 2007) value function: 


H^{s) = lim E 

n^oo 


'n—1 


^(r(st, 7r{st)) - k{st, 7r{st)) p^) 


Sq S, TT 


Lt=o 


which measures “how good” the state s is, under vr, with respect to the average . The 
corresponding Bellman equation, whose solutions include the gain-optimal policies, is 


H*{s) = r{sX{s)) - k{s,Ti*{s))p* [R*(s')] 


(9) 


where the expectation on the right hand side is over following the optimal policy for any s'. 


Puterman (1994) and Mahadevan (1996) present comprehensive discussions of dynamic 


programming methods to solve average-reward problems. The solution principle is similar 
to the one used in cumulative reward tasks: value evaluation followed by policy iteration. 
However, an approximation of the average rewards of the policy being evaluated must be 
either computed or approximated from successive iterates. The parametric variation of 


average-reward value iteration due to 

Bertsekas 

(1998 

) is 

central to our method and will be 

discussed in depth below. For SMDPs, 

Das et al. 

(1999 

) discuss specific synchronous and 


asynchronous versions of the relative value iteration algorithm due to White (1963). 


Among the model-based methods listed above for cumulative reward problems, and 
Rmax originally have definitions on average reward models, including in their PAC-MDP 
bounds polynomial terms on a parameter called the optimal e-mixing time, defined as the 
smallest time after which the observed average reward of the optimal policy actually becomes 
e-close to p*. 

In a related framework, also with probability (1 — 6) as in PAC-MDP, the UCRL2 


algorithm of Jaksch et al. (2010) attempts to minimize the total regret (difference with the 


accumulated rewards of a gain-optimal policy) over a T-step horizon. The regret of this 
algorithm is bounded by 0(A|5| y^|.!4|T), where the diameter parameter A of the MDP is 
defined as the time it takes to move from any state to any other state using an appropriate 
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policy. Observe that for A to be finite, any state must be reachable from any other, so 
the problem must be communicating, which is a more rigid assumption than the unichain 
condition (Puterman, 1994). Similarly, the REGAL algorithm of Bartlett and Tewari[ ( 2009| ) 
has a regret bound 0{H\S\\/\Sl\l^), where H \s a bound on the span of the optimal bias 
vector. In this case, the underlying process is required to be weakly communicating, that is, 
for the subsets K" and of recurrent and transient states in Equation Q to be the same 
for all vr G n. This is also a more rigid assumption than the unichain condition. 

Regarding PAC-MDP methods, no “model free” algorithms similar to delayed-Q are 


known at present for average reward problems. Mahadevan (1996) discusses, without com¬ 


plexity analysis, a model-based approach due to Jalali and Ferguson (1989), and further 


refined by Tadepalli and Ok (1998) into the H-learning algorithm, in which relative value it¬ 


eration is applied to transition probability matrices and gain is approximated from observed 
samples. 

Model-free methods for average reward problems, with access to observations of state 
transitions and associated rewards, are based on the (gain-adjusted) Q-value update 

Qt+i{st,at) (1 - at) Qt{st,at) + at {rt+i - pt h+i -h maxQt{st+i,a)^ , (10) 

where at is a learning rate and pt is the current estimation of the average reward. 


Algorithm 1 Generic SMDP solver 
Initialize (tt, p, and H or Q) 
repeat forever 
Act 

Learn approximation to value of current tt 
Update TT from learned values 
Update p 


A close analysis of the literature reveals that H-learning and related model-based algo¬ 
rithms, as well as methods based on the update in Equation (10) can be described using 
the generic Algorithm The “Act” step corresponds to the observation of (usually) one 
{s, a, s', r, k) tuple following the current version of vr. A degree of exploration is commonly 
introduced at this stage; instead of taking a best-known action from argmax^g^^ Q{s, a), 
a suboptimal action is chosen. For instance, in the traditional e-greedy action selection 
method, with probability e a random action is chosen uniformly. Sutton and Barto (1998) 
discuss a number of exploration strategies. The exploration/exploitation trade-off, that is, 
when and how to explore and learn and when to use the knowledge for reward maximization, 
is a very active research field in reinforcement learning. All of the PAG-MDP and related 
algorithms listed above are based on an “optimism in the face of uncertainty” scheme (Lai 


and Robbins, 1985), initializing LI or Q to an upper bound on value for all states, to address 


more or less explicitly the problem of optimal exploration. 

In H-learning, the learning stage of Algorithm includes updating the approximate 
transition probabilities for the state and action just observed and then estimating the state 
value using a version of Equation ([^ with the updated probabilities. In model-free methods, 
summarized in Table the learning stage is usually the 1-step update of Equation (10). 
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The p update is also commonly done after one step, but there are a number of different 
update rules. Algorithms in the literature vary in two dimensions. The first is when to 
update. Some compute an updated approximation of p after every action while others do 
it only if a best-known action was taken, a G argmax^g^^ Qis, a). The second dimension 
is the way the updates are done. A natural approach is to compute pt as the ratio of the 
sample rewards and the sample costs, 

r+l 

Pr+l = ^ , (11) 

i=l 

where, i = 1 • • • r may indicate all decision epochs or only those on which greedy actions 
were taken, depending on when the algorithm updates. We refer to this as the ratio update. 
Alternatively, the corrected update is of the form 

Pr+i = (I - /3r) Pt + fr.r+1 + luaxa) “ maxQ^(s.r, «)) , 

Kr+1 \ a a / 

whereas in the term-wise corrected update, separately 

Ur+l — (1 /3r) W T f^T j 

Ct-\-\ (1 Pt') Cr T Pr ^r-1-1 ; 


and 


Pt+i — 


W-H 

Cr-1-1 


In the last two cases, Pt is a learning rate. In addition to when and how to perform the 
p updates, algorithms in the literature also vary in the model used for the learning rates, at 
and Pt- The simplest models take both parameters to be constant, equal to a and P for all 
t (or r). As is the case for Q-learning, convergence is proved for sequences of at (and now 
Pt) for which the conditions in Equations Q and Q hold. We call these decaying learning 
rates. A simple decaying learning rate is of the form at = It can be easily shown that 
this rate gives raise to the ratio p updates of Equation Some methods require keeping 
an individual (decaying) learning rate for each state-action pair. A type of update for which 
Equations 0 and Q —and the associated convergence guarantees—hold, and which may 
have practical advantages is the “search-then-converge” procedure of Darken et al. (1992), 
called DCM after its authors. A DCM at update would be, for example, 


at = 


at) 


1 + 


OlT+t 


where ao and a,- are constants. 

Table describes the p updates and learning rates of the model-free average reward 
algorithms found in the literature, together, when applicable, with those for two model- 
based (AAC and H-learning) and two hierarchical algorithms (MAXQ and HAR). 
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at 

Only update 
if greedy a? 

P 

/3r 

Constant 

Yes 

Corrected 

Constant 

Constant 

No 

Corrected 

Constant 

Constant 

Yes 

Ratio 

— 

DCM 

No 

Ratio 

— 

Individual 

No 

Ratio 

Decaying 

Individual 

No 

Term-wise 

corrected 

Decaying 


Method 


R-learning 


Schwartz 


(19931 


Algorithm^ 


Singh (19941 
Algorithm 4 


Singh 




(19941 


Das et al. (1999 \ 
“JNew algorithm' 


Gosavi 


(20041 


Ro b bins-Monro Version 


Gosavi (20041 


AAC 


Ratio — 

Corrected Decaying 

Jalali and Ferguson (jl989 1 

H-Learning 

Yes 

Tadepalli and Ok (19981 

maxq 

Ghavamzadeh and Mahadevan (20011 

mR --— 

Ghavamzadeh and Mahadevan (20071 

Yes 

Ratio — 

Ratio — 


Table 1: Summary of learning rates and p updates of model-free, model-based and hierar¬ 
chical aver age-reward algorithms. 


3.3 Stochastic Shortest Path H and Q-Learning 


We focus our interest on an additional model-free average-reward algorithm due to Abounadi 


et al. (2002), suggested by a dynamic programming method by Bertsekas (1998), which 


connects an average-reward problem with a parametrized family of (cumulative reward) 
stochastic shortest path problems. 

The fundamental observation is that, if the problem is unichain, the average reward of a 
stationary policy must equal the ratio of the expected total reward and the expected total 
cost between two visits to a reference (recurrent) state. Thus, the idea is to “separate” 
those two visits to the start and the end of an episodic task. This is achieved splitting a 
recurrent state into an initial and a terminal state. Assuming the task is unichain and sj 
is a recurrent state, we refer to a Bertsekas split as the resulting problem with 


• State space 5 |J{st}) where st is an artificial terminal state that, as defined above, 
once reached transitions to itself with probability one, reward zero, and, for numerical 
stability, cost zero. 

• Action space 
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Transition probabilities 




P{s^\s,a) Vs^ 7 ^ sj, St, 
0 s' = s/, 

P{si\s,a) s'= St- 


In the restricted setting of average-reward MDPs, Bertsekas (1998) proved the conver¬ 


gence of the dynamic programming algorithm with coupled iterations 


Ht+i{s) ^ max 

a&Jis 


K“' + E «<(>') 

s' \ s' / 

Pt+i ^ Pt P Pt Ht{si) , 


Pt 


Vs 7 ^ St 


where H{st) is set to zero for all epochs (that is, st is terminal) and f3t is a decaying 
learning rate. The derivation of this algorithm includes a proof that, when pt equals the 
optimal gain /?*, the corrected value of the initial state is zero. This is to be expected, since 
Pt is subtracted from all and if it equals the expected average reward, the expectation 
for Si vanishes. Observe that, when this is the case, pt stops changing between iterations. 
We provide below an alternative derivation of this fact, from the perspective of fractional 
programming. 

[Abounadi et al. (2002) extended the ideas behind this algorithm to model-free methods 
with the stochastic shortest path Q-learning algorithm (synchronous), SSPQ, with Q and 
p updates, after taking an action, 

Qt+i{st,at) ^ {I - at) Qt{st,at) + at {rt+i - pt + m.&yiQt{st+i,a)^ , 

Pt+i ^T{pt + Pt maxQt(s7,a)) , 

a 

where P is the projection to an interval [—K,K] known to contain p*. 

Remark 4 Both SSP methods just described belong to generie family described by Algorithm 
Moreover, the action value update of SSPQ is identical to the MDP version of average 
corrected Q-updates, Equation (10). 


The convergence proof of SSPQ makes the relationship between the two learning rates 
explicit, requiring that 


Pt = o{at) , 

making the gain update considerably slower than the value update. This is necessary so the 
Q-update can provide sufficient approximation of the value of the current policy for there to 
be any improvement. If, in the short term, the Q-update sees p as (nearly) constant, then 
the update actually resembles that of a cumulative reward problem, with rewards ^ 

The method presented below uses a Bertsekas split of the SMDP, and examines the extreme 
case in which the p updates occur only when the value of the best policy for the current 
gain can be regarded as known. 
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3.4 A Motivating Example 


We will use a simple average-reward example from the discussion of Schwartz’s R-learning 


by Sutton and Barto (1998, see section 6.7) to study the behaviour of some of the algorithms 
just described and compare them with the method proposd in this paper. 

In the access control queuing task, at the head of a single queue that manages access to 
n = 10 servers, customers of priorities {8,4, 2,1} arrive with probabilities {0.4, 0.2, 0.2,0.2}, 
respectively. At each decision epoch, the customer at the head of the queue is either 
assigned to a free server (if any are available), with a pay-off equal to the customer’s 
priority; or rejected, with zero pay-off. Between decision epochs, servers free independently 
with probability p = 0.06. Naturally, the goal is to maximize the expected average reward. 
The states correspond to the combination of the priority of the customer at the head of the 
queue and the number of free servers, and the actions are simply “accept” and “reject”. 
For simplicity, there is a single state corresponding to no free servers for any priority, with 
only the ’’reject” action available an reward zero. 

To ensure that our assumptions hold for this task, we use the following straightforward 
observation: 


Proposition 5 In the access control queuing task, all optimal policies must accept cus¬ 
tomers of priority 8 whenever servers are available. 

Thus, if we make “accept” the only action available for states with priority 8 and any 
number of free servers, the resulting task will have the same optimal policies as the original 
problem, and the state with all servers occupied will become recurrent. Indeed, for any state 
with m free servers, and any policy, there is a nonzero probability that all of the next m 
customers will have priority 8 and no servers will free in the current and the next m decision 
epochs. Since the only available action for customers with priority 8 is to accept them, all 
servers would fill, then, so for any state and policy there is a nonzero probability of reaching 
the state with no free servers, making it recurrent. Moreover, since this recurrent state can 
be reached from any state, there must be a single recurrent class per policy, containing the 
all-occupied state and all other states that can be reached from it under the policy, so the 
unichain condition also holds for the resulting task. 

The unique optimal policy for this task, its average-adjusted value and average reward 
{p* ~ 3.28) can be easily found using dynamic programming, and will be used here to 
measure algorithm performance. [Sutton and Bar^ (1998) show the result of applying R- 
learning to this task using e-greedy action selection with e = 0.1 (which we will keep for 
all experiments), and parameter^ at = Pt = 0.01. We call this set-up “R-Learning 1.” 
However, instead of a single run of samples, which results in states with many free servers 
being grossly undersampled, in order to study the convergence of action values over all states 
we reset the process to an uniformly-selected random state every 10 steps in all experiments. 

The other algorithms tested are: “R-learning 2,” with a smaller rate for the p updates 
{at = 0.01, fit = 0.000001), “SMART” (Das et ah, 1999), with parameters ao = 1 and 
= 10® for the DCM update of at, the “New Algorithm” of Gosavi (2004) with individual 


a 


decaying learning rates equal to the inverse of the number of past occurrences of each state- 


2. The on-line errata of the book corrects a to 0.01, instead of the stated 0.1. 
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Figure 1: Performance of average-reward reinforcement learning algorithms in the queuing 
task. Top left; gain in ”R-Learning 1”. Top right: gain in the other set-ups. 
Bottom left: convergence to the value of the optimal policy for all states. Bottom 
right: number of states for which the current policy differs from the optimal. 


action pair, and “SSPQ”-learning with rates at = and (3t = In all cases, Q{s,a) and 
p are initialized to 0. 


Figure shows the results of one run (to avoid flattening the curves due to averaging) 
of the different methods for five million steps. All algorithms reach a neighbourhood of p* 
relatively quickly, but this doesn’t guarantee an equally fast approximation of the optimal 
policy or its value function. The value of Pt = 0.01 used in “R-learning 1” and by |Sutton| 


and Barto causes a fast approximation followed by oscillation (shown in a separate plot for 


clarity). The approximations for the smaller /3 and the other approaches are more stable. 


Overall, both R-learning set-ups, corresponding to solid black and grey lines in the 
plots in Figure [T| achieve a policy closer to optimal and a better approximation of its value, 
considerably faster than the other algorithms. On the other hand, almost immediately 
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SMART, Gosavi’s “New Algorithm”, and SSPQ reach policies that differ from the optimal 
in about 5-10 states, but remain in that error range, whereas after a longer transient the 
R-learning variants hnd policies with less than 5 non-optimal actions and much better value 
approximations. 

Remarkably, “R-learning 2” is the only set-up for which the value approximation and 
differences with the optimal policy increase at the start, before actually converging faster 
and closer to the optimal than the other methods. Only for that algorithm set-up is the 
unique optimal policy visited sometimes. This suggests that, for the slowest updating p, 
a different part of the policy space is being visited during the early stages of training. 
Moreover, this appears to have a beneficial effect, for this particular task, on both the 
speed and quality of final convergence. 

Optimal nudging, the method introduced below, goes even further in this direction, 
freezing the value of p for whole runs of some reinforcement learning method, and only 
updating the gain when the value of the current best-known policy is closely approximated. 



Figure 2: Convergence to the value of the optimal policy for all states under optimal nudg¬ 
ing, compared with the best R-learning experimental set-up. 


Figure shows the result of applying a vanilla version of optimal nudging to the access 
control queuing task, compared with the best results above, corresponding to “R-learning 
2”. After a 0-step-required to compute a parameter-which takes 500.000 samples, the 
optimal nudging steps proper are a series of blocks of 750.000 transition observations for a 
fixed p. The taking of each action is followed by a Q-learning update with the same a and 
e values as the R-learning experiments. The edges of plateaus in the black curve to the left 
of the plot signal the points at which the gain was updated to a new value following the 
optimal nudging update rule. 

The figure shows that both algorithms have similar performance, reaching comparably 
good approximations to the value of the optimal policy in roughly the same number of 
iterations. Moreover, optimal nudging finds a similarly good approximation to the optimal 
policy, differing from it at the end of the run in only one state. However, optimal nudging 
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has a number of further advantages: it has one parameter less to adjust, the /? learning rate, 
and consequently it performs one update less per iteration/action taking. This can have a 
dramatic effect in cases, such as this one, when obtaining sample transitions is quick and 
the updates cannot be done in parallel with the transitions between states. Additionally, 
there is no fine tuning in our implementation of the underlying Q-learning method; whereas 
in “R-learning” a was adjusted by Sutton and Barto to yield the best possible results, and 
we did the same when setting /? in “R-learning 2”, the implementation of optimal nudging 
simply inherits the relevant parameters, without any adjustments or guarantees of best 
performance. Even setting the number of samples to 750.000 between changes of p is a 
parameter that can be improved. The plateaus to the left of the plot suggest that in these 
early stages of learning good value approximations could be found much faster than that, 
so possibly an adaptive rule for the termination of each call to the reinforcement learning 
algorithm might lead to accelerated learning or free later iterations for finer approximation 
to optimal value. 

4. Optimal nudging. 

In this section we present our main algorithm, optimal nudging. While belonging to the 
realm of generic Algorithm the philosophy of optimal nudging is to disentangle the gain 
updates from value learning, turning an average-reward or semi-Markov task into a sequence 
of cumulative-reward tasks. This has the dual advantage of letting us treat as a black box 
the reinforcement learning method used (inheriting its speed, convergence and complexity 
features), while allowing for a very efficient update scheme for the gain. 

Our method favours a more intuitive understanding of the p term, not so much as 
an approximation to the optimal average reward (although it remains so), but rather as 
a punishment for taking actions, which must be compensated by the rewards obtained 
afterwards. It exploits the Bertsekas split to focus on the value and cost of successive visits 
to a reference state, and their ratio. The w — I space is introduced as an arena in which it 
is possible to update the gain and ensure convergence to the solution. Finally we show that 
updates can be performed optimally in a way that requires only a small number of calls to 
the black-box reinforcement learning method. 

Summary of Assumptions. The algorithm derived below requires the average-reward 
semi-Markov decision process to be solved to have finite state and action sets, to contain 
at least one recurrent state s/ and to be unichain. Further, it is assumed that the expected 
cost of every policy is positive and (without loss of generality) larger than one, and that 
the magnitude of all non-zero action costs is also larger than one. 

To avoid a duplication of the cases considered in the mathematical derivation that would 
add no further insight, it will also be assumed that at least one policy has a positive average 
reward. If this is the case, naturally, any optimal policy will have positive gain. 

4.1 Fractional Programming. 

Under our assumptions, a Bertsekas split is possible on the states, actions, and transition 
probabilities of the task. We will refer to the value of a policy as the expected reward from 
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the induced initial state, sj, following the policy, 


^r(st, 7r(st)) 


.t=o 


= v'^{si) = E 

to its cost as the expected cumulative action costs, 

= c^{si)=E 

and, as customary, to its gain as the value/cost ratio, 


So = Sj, TT 


( 12 ) 


'^k{st,TT{st)) I So = SI, TT 
Lf=o 


p = 


The following is a restatement of the known fractional programming result (Charnes and 


Cooper, 1962), linking in our notation the optimization of the gain ratio to a parametric 


family of linear problems on policy value and cost. 


Lemma 6 (Fractional programming) The following average-reward and linear-combination- 
of-rewards problems share the same optimal policies, 

argmax — = argmax + p* ( —c’^) , 

ttGII ^ ttGIT 

for an appropriate value of p* such that 

max + p* {—c") = 0 . 

TTsn 


Proof The Lemma is proved by contradiction. Under the stated assumptions, for all 
policies, both v'" and are finite, and > 1. Let a gain-optimal policy be 

* 

TT G argmax — . 

ttGIT ^ 


If V* = v'"* and c* = c^*, let 


TT * 

V V 

max — = — = p ; 
TTsn c'^ c* 


then 


v* + p*{-c*) = 0 . 

Now, assume there existed some policy it with corresponding r) and c that had a better 
fitness in the linear problem, 

i>-\- p* (-c) > 0 . 
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It must then follow (since all are positive) that 

V > p*c , 

^ ^ * 

-.> P > 
c 

which would contradict the optimality of tt*. ■ 


This result has deep implications. Assume p* is known. Then, we would be interested 
in solving the problem 

TT* G argmax v'^ — p* , 

ttGR 

oo 

'^r{st,7r{st)) 
t=0 
oo 

^r(st,7r(st)) 

.t=o 

which is equivalent to a single cumulative reward problem with rewards r — p*k, where, as 
discussed above, the rewards r and costs k are functions of {s,a,s'). 


= argmax E 

ttGII 


= argmax E 

7rsn 


So = S/ 

1 

* 

E4 

' OO 

k{st,7r{st)) \so = SI 



.t=o 

- p* k{st,Tr{st)) 1 

So = Sj , 


4.2 Nudging. 


Naturally, p* corresponds to the optimal gain and is unknown beforehand. In order to 
compute it, we propose separating the problem in two parts: finding by reinforcement 
learning the optimal policy and its value for some fixed gain, and independently doing the 
gain-update. Thus, value-learning becomes method-free, so any of the robust methods listed 


in Section 3.1 can be used for this stage. The original problem can be then turned into a 
sequence of MDPs, for a series of temporarily fixed pi. Hence, while remaining within the 
bounds of the generic Algorithm [l| we propose not to hurry to update p after every step or 
Q-update. Additionally, as a consequence of Lemma the method comes with the same 
solid termination condition of SSP algorithms: the current optimal policy vr* is gain-optimal 
if = 0. 

This suggests the nudged version of the learning algorithm. Algorithm The term 
nudged comes from the understanding of p as a measure of the punishment given to the 
agent after each action in order to promote receiving the largest rewards as soon as possible. 
The remaining problem is to describe a suitable p-update rule. 


Algorithm 2 Nudged Learning 
Set Bertsekas split 
Initialize (tt, p, and H or Q) 

repeat 

Set reward scheme to (r — pk) 
Solve by any RL method 
Update p 

until H'^ (sj) = (s/,7r*(s/)) = 0 
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4.3 The w — I Space. 


We will present an variation of the w — I space, originally introduced by Uribe et al. (2011) 
for the restricted setting of Markov decision process games, as the realm to describe p 
uncertainty and to propose a method to reduce it optimally. 

Under the assumptions stated above, the only additional requirement for the definition 
of the w — I space is a bound on the value of all policies: 

Definition 7 (D) Let D be a bound on unsigned, unnudged reward, such that 


D > max v'" 

ttGII 


and 


—D < min v'" . 

ttGR 

Observe that D is a—possibly very loose—bound on p*. However, it can become tight 
in the case of a task and gain-optimal policy in which termination from sj occurs in one 
step with reward of magnitude D. Importantly, under our assumptions and Definition 
all policies vr G H will have finite real expected value —D < < D and finite positive cost 

> 1 . 


4.3.1 The tc-/M apping. 

We are now ready to propose the simple mapping of a policy vr G H, with value and cost 
c^, to the 2-dimensional w — I space using the transformation equations: 


w 


TT 


D + v-^ 
2c^ 


r 


D-v^ 

2c^ 


(13) 


The following properties of this transformation can be easily checked from our assumptions: 


Proposition 8 (Properties of the w — I space.) For all policies vr G H, 

1. 0<w^ <D;0<F < D. 

2. w'^ + F = — < D. 

3. If = D, then F = 0. 

4- If = —D, then w'^ = 0. 

5. lim = (0,0). 


As a direct consequence of Proposition 8 the whole policy space, which has 




elements, with \JIs\ equal to the average number of actions per state, is mapped into a cloud 
of points in the w — I space, bounded by a triangle with vertices at the origin and the points 
(D,0) and (0,D). This is illustrated in Figure]^ 
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Figure 3: Mapping of a policy cloud to the w — I space. 


4.3.2 Value and Cost m w - 1 . 


Cross-multiplying the equations in (13), it is easy to hnd an expression for value in w — I, 


V 


TT 


= D 


w'^ — P 
w'^ + 


(14) 


All of the policies of the same value (for instance v) lie on a level set that is a line with 
slope and intercept at the origin. Thus, as stated in Proposition policies of value 
±Z1 lie on the w and I axes and, further, policies of expected value 0 lie on the w = I line. 
Furthermore, geometrically, the value-optimal policies must subtend the smallest angle with 
the w axis and vertex at the origin. Figure]^ (left) shows the level sets of Equation (14). 



Figure 4: Value (left) and cost (right) level sets in the w — I space. 
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On the other hand, adding both Equations in (13), cost in the w — l space is immediately 
found as 


D 


c = 


w'" + 


(15) 


This function also has line level sets, in this case all of slope —1. The w + I = D edge 
of the triangle corresponds to policies of expected cost one and, as stated in the properties 
Proposition]^ policies in the limit of infinite cost should lie on the origin. Figure (right) 
shows the cost level sets in the w — I space. An interesting question is whether the origin 
actually belongs to the w — l space. Since it would correspond to policies of infinite expected 
cost, that would contradict either the unichain assumption or the assumption that sj is a 
recurrent state, so the origin is excluded from the w — I space. 

The following result derives from the fact that, even though the value and cost expres¬ 


sions in Equations (14) and (15) are not convex functions, both value and cost level sets 


are lines, each of which divides the triangle in two polygons: 

Lemma 9 (Value and cost-optimal policies in tc — /) The policies of maximum or min¬ 
imum value and cost map in the w — I space to vertices of the convex hull of the point cloud. 


Proof All cases are proved using the same argument by contradiction. Consider for in¬ 
stance a policy of maximum value n*, with value v*. The v* level set line splits the w — I 
space triangle in two, with all points corresponding to higher value below and to lower value 
above the level set. If the mapping (tc^ ,1'" ) is an interior point of the convex hull of the 
policy cloud, then some points on an edge of the convex hull, and consequently at least one 
of its vertices, must lie on the region of value higher than v*. Since all vertices of this cloud 
of points correspond to actual policies of the task, there is at least one policy with higher 
value than vr*, which contradicts its optimality. The same argument extends to the cases 
of minimum value, and maximum and minimum cost. ■ 


4.3.3 Nudged Value m w -1. 

Recall, from the fractional programming Lemma that for an appropriate p* these two 
problems are optimized by the same policies: 


argmax — = argmax v -\- p y—c 

ttGII ^ ttGII 


(16) 


By substituting on the left hand side problem in Equation (16) the expressions for w'^ 
and l"^, the original average-reward semi-Markov problem becomes in the w — I space the 
simple linear problem 


V 

argmax — = argmax . (17) 

ttSII ^ ttSIT 

Figure illustrates the slope-one level sets of this problem in the space. Observe that, 
predictably, the upper bound of these level sets in the triangle corresponds to the vertex at 
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Figure 5: Level sets of the linear correspondence of an aver age-reward SMDP in the w — I 
space. 


(D,0), which, as discussed above, in fact would correspond to a policy with value D and 
unity cost, that is, a policy that would receive the highest possible reinforcement from the 
recurrent state and would return to it in one step with probability one. 

Conversely, for some pi (not necessarily the optimal p*), the problem on the right hand 


side of Equation (16) becomes, in the w — I space, 


TT TT -r - Pi 

argmax v — PiC = argmax D - - - , 

TTsn TTsn W" -I- r 

where we opt not to drop the constant D. We refer to the nudged value of a policy, h^. as, 

w'^ — — Pi 


K=D' 

Hi 


w'^ -|- H 


All policies vr sharing the same nudged value hp. lie on the level set 


D-hp^ 
+ h 


D 


D + hp- 


Pi 1 


(18) 


which, again, is a line on the w — I space whose slope, further, depends only on the common 
nudged value, and not on pi. Thus, for instance, for any pi, the level set corresponding to 
policies with zero nudged value, such as those of interest for the termination condition of 
the fractional programming Lemmaand Algorithmic will have unity slope. 

There is a further remarkable property of the line level sets corresponding to policies of 
the same value, summarized in the following result: 


Lemma 10 (Intersection of nudged value level sets) For a given nudging pi, the level 
sets of all possible hp. share a common intersection point, (^, —y). 
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Proof This result is proved through straightforward algebra. Consider, for a fixed p*, the 
line level sets corresponding to two nudged values, hp. and hp-. Making the right hand sides 
of their expressions in the form of Equation (18) equal, and solving to find the w component 
of their intersection yields: 

D - hp^ D D - hp. D 

D + ~ D + hJ' ~ D + ~ D + h/' ’ 


^pi ^ '^Pi ' '^Pi 

2D{hp^ — hp^)w = Dijip^ — hp^)pi 
_ Pi 

^ 2 ■ 


'-Pi 


Finally, replacing this in the level set for hp., 


I = 


I = 


D - hp^ Pi _ D 
D + hp. D + hp^ 
—D — hp^ Pi 


Pi 


D + h 


pi 


fH 
' 2 


Thus, for a set pi, all nudged level sets comprise what is called a pencil of lines, a 
parametric set of lines with a unique common intersection point. Figure]^ (right) shows an 
instance of such a pencil. 



Figure 6: Nudged value in the w — I space. Left, value level sets as a pencil of lines. Right, 
solution to both problems with p* in Lemma 


The termination condition states that the nudged and average-reward problems share 
the same solution space for p* such that the nudged value of the optimal policy is 0. Figure 
[^(left) illustrates this case: the same policy in the example cloud simultaneously solves the 
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linear problem with level sets of slope one (corresponding to the original average-reward 
task in Equation 17), and the nudged problem with zero-nudged value. 


4.4 Minimizing p-uncertainty. 

We have now all the elements to start deriving an algorithm for iteratively enclosing p* 
quickly in the smallest neighborhood possible. Observe that from the outset, since we are 
assuming that policies with non-negative gain exist, the bounds on the optimal gain are the 
largest possible. 


0<p*<D . (19) 

Geometrically, this is equivalent to the fact that the vertex of the pencil of lines for p* can 
be, a priori, anywhere on the segment of the w = —I line between (0, 0), if the optimal policy 
has zero gain, and (^j optimal policy receives reinforcement D and terminates 

in a unity-cost step (having gain D). For our main result, we will propose a way to reduce 
the best known bounds on p* as much as possible every time a new pi is computed. 


4.4.1 Enclosing Triangles, Left and Right p-ungertainty. 

In this section we introduce the notion of an enclosing triangle. The method introduced 
below to reduce gain uncertainty exploits the properties of reducing enclosing triangles to 
optimally reduce uncertainty between iterations. 


Definition 11 (Enclosing triangle) A triangle in the w — I space with vertices A = 
{wA, I a), B = {wb, Ib), and C = {wc, Ic) is an enclosing triangle if it is known to contain 
the mapping of the gain-optimal policy and, additionally, 


1. WB > wa; Ib > Ia- 

2 ^b—Ia — 1 
■ wb-wa 

3. WA > Ia; Wb > Ib; wc > Ic- 


4- 


p _ wb—Ib 

2 


WA-Ia <• WQ-lc 
2 — 2 


5. 


Q <; Iq-Ia 

— Wc-WA 


< 1 


6 . 


Iq—Ib 

wc—wb 


> 1 


Q. 


Figure illustrates the geometry of enclosing triangles as defined. The first two condi¬ 
tions in Definition 11 ensure that the point B is above A and the slope of the line that joins 
them is unity. The third condition places all three points in the part of the w — I space on 
or below the w = I line, corresponding to policies with non-negative gain. 

In the fourth condition, P is defined as the rc-component of the intersection of the line 
with slope one that joins points A and B, and the w = —I line; and Q as the w value of the 
intersection of the line with slope one that crosses point C, and w = —1. Requiring P < Q 
is equivalent to forcing C to be below the line that joins A and B. 
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Figure 7: Illustration of the definition of an enclosing triangle. The segment joining vertices 
A and B belongs to a line with slope one. Vertex C can lie anywhere on the light 
grey triangle, including its edges. P and Q are the {w components of the) slope- 
one projection of the vertices to the w = —I line 


The fifth and sixth conditions confine the possible location of C to the triangle with 
vertices A, B, and the intersection of the lines that cross A with slope zero and B with 
slope minus one. This triangle is pictured with thick dashed lines in Figure 


Remark 12 (Degenerate enclosing triangles) Observe that, in the definition of an en¬ 
closing triangle, some degenerate or indeterminate cases are possible. First, if A and B are 
concurrent, then C must be concurrent to them—so the “triangle” is in fact to a point—and 
some terms in the slope conditions (2, 5, 6) in Definition 11 become indeterminate. Alter¬ 
nately, if P = Q in condition 4, then A, B, and C must be collinear. We admit both of these 
degenerate cases as valid enclosing triangles since, as is discussed below, they correspond to 
instances in which the solution to the underlying average-reward task has been found. 


Since we assume that positive-gain policies, and thus positive-gain optimal policies exist, 
direct application of the definition of an enclosing triangle leads to the following Proposition: 

Proposition 13 (Initial enclosing triangle) A = (0, 0), B = (^, y), C* = {D, 0) is 

an enclosing triangle. 

In order to understand the reduction of uncertainty after solving the reinforcement 
learning task for a fixed gain, consider the geometry of setting pi to some value within the 
uncertainty range of that initial enclosing triangle, for example pi = ^. 

If, after solving the reinforcement learning problem with rewards r(s, a, s') — pik{s, a, s') 
the value of the initial state sj for the resulting optimal policy were vf =0, then the 
semi-Markov task would be solved, since the termination condition would have been met. 
Observe that this would not imply, nor require, any knowledge of the value and cost, or 
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conversely, the w and I coordinates of that optimal policy. However, the complete solution 
to the task would be known: the optimal gain would be pi, and the optimal policy and its 
gain-adjusted value would be, respectively, the policy and (cumulative) value just found. 
In the w — I space, the only knowledge required to conclude this is that the coordinates of 
the optimal policy to both problems would lie somewhere inside the w — I space on the line 
with slope = 1 that crosses the point (^, ^). 

On the other hand, if the optimal policy after setting pi = ^ and solving were some 
positive value p* > 0 (for example v* = ^), the situation would be as shown in Figure 
The termination condition would not have been met, but the values of pi and p would still 
provide a wealth of exploitable information. 



Figure 8: Geometry of the solution for a fixed pi in the initial enclosing triangle. The 
nudged optimal policy maps somewhere on the ST segment, so the gain-optimal 
policy must map somewhere on the area shaded “2”, and all points in regions “1” 
and “3” can be discarded. 

Setting Pi = ^ causes the pencil of value level set lines to have common point (^, ^). 
The optimal value p* = ^ corresponds to a level set of slope 0.5, pictured from that point 
in thick black. The nudged-value optimal policy in w — I, then, must lie somewhere on the 
segment that joins the crossings of this line with the lines I = 0 (point S) and w + I = D 
(point T). This effectively divides the space in three regions with different properties, 
shaded and labelled from “1” to “3” in Figure [Sj 

First, no policies of the task can map to points in the triangle “3”, since they would have 
higher pi-nudged value, contradicting the optimality of the policy found. Second, policies 
with coordinates in the region labelled “1” would have lower gain than all policies in the 
ST segment, which is known to contain at least one policy, the nudged optimizer. Thus, 
the gain-optimal policy must map in he tc — / space to some point in region “2”, although 
not necessarily on the ST segment. Moreover, clearly “2” is itself a new enclosing triangle, 
with vertices A = S, C' = T, and B' on the intersection of the w + I = D line and the 
slope-one line that crosses the point S. 
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As a direct consequence of being able to discard regions “1” and “3”, the uncertainty 
range for (one-half of) the optimal gain reduces from Q — P, that is, the initial ^ that 
halves the range in Equation (19), to the difference of the 1-projections of the vertices of 
triangle “2” to the w = —I line, Q' — P'. For the values considered in the example, total 
uncertainty reduces approximately fivefold, from H to 

Thus, running the reinforcement learning algorithm to solve the nudged problem, with 
Pi within the bounds of the initial enclosing triangle allows, first, the determination of a new, 
smaller enclosing triangle and second, a corresponding reduction on the gain uncertainty. 



Figure 9: Possible outcomes after solving a nudged task within the bounds of an arbitrary 
enclosing triangle. In all cases either the uncertainty range vanishes to a point 
(top left and bottom right) or a line segment (bottom left), and the problem 
is solved; or a smaller enclosing triangle results (top middle and right, bottom 
middle). 


Both of these observations are valid for the general case, as illustrated in Figure 
Consider an arbitrary enclosing triangle with vertices A, B, and C. Some pi is set within 
the limits determined by the triangle and the resulting nudged problem is solved, yielding 
a nudged-optimal policy tt* , with value v*. 

In the w — I space, in all cases, the line from the point (^, —y), labelled simply O.bpi 
in the plots, has slope and intercepts the AC and AB or BC segments. The three 
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possible degenerate cases are reduction to a point, if the level set from pi crosses the points 
A or C (top left and bottom right plots in Figure]^ respectively), and the situation with 
V* = 0, where the problem is solved and, geometrically, the uncertainty area reduces from 
an enclosing triangle to a line segment. In these three instances, the method stops because 
the new gain uncertainty reduces to P' — Q' = 0. 

If V* is negative, the level set can intercept AB (top middle plot in Figure or BC 
(top left plot). In the hrst case, the new enclosing triangle, shaded dark grey, has the same 
A' = A vertex, B' is the intercept with AB, and C is the intercept with AC. Thus, P' = P 
and Q' < Q, for a strict reduction of the uncertainty. Points in the light grey region to the 
right of this new triangle cannot correspond to any policies, because they would have solved 
the Pi nudged problem instead, so the gain optimal policy must be inside the triangle with 
vertices A', B', and C. In the second case, the level set intercept with AC again becomes 
C, its intercept with BC is the new B', and the slope-1 projection of B' to AC is the new 
A'. Again, points in the light grey triangle with vertices B', C, C cannot contain mapped 
policies, or that would contradict the nudged optimality of tt* and, furthermore, points 
in the trapeze to the left of the new enclosing triangle cannot contain the gain-optimal 
policy, since any policies on C'B', which includes tt*, have larger gain. The new Q' is the 
1-projection of C to w = —I and it is smaller than, not only the old Q, but also pi. The 
new P' is the 1-projection of B' (or A') to w = —I and it is larger than P, resulting in a 
strict reduction of the gain uncertainty. 

If pt is positive (bottom middle plot in Figure]^, the new C is the BC intercept of 
the level set. A' is its AC intercept and B' is the 1-projection of A' to BC. Otherwise, the 
same arguments of the preceding case apply, with P' larger than P and pi, and Q' smaller 
that Q. 

These observations are formalized in the following result: 


Lemma 14 (Reduction of enclosing triangles) Let the points A, B, and C define an 
enclosing triangle. Setting wb—Ib < Pi < wq—Ic CLn-d solving the resulting task with rewards 
(r — Pi k) to find v* results in a strictly smaller, possibly degenerate enclosing triangle. 


Proof The preceding discussion and Figure show how to build the triangle in each case 
and why it must contain the mapping of the optimal policy. In Appendix A we show that 
the resulting triangle is indeed enclosing, that is, that it holds the conditions in Definition 
11 and that it is strictly smaller than the original enclosing triangle. ■ 


4.4.2 An additional termination condition 

Another very important geometrical feature arises in the case when the same policy is 
nudged-optimal for two different nudges, pi and p 2 , with optimal nudged values of different 
signs. This is pictured in Figure [T^ Assume that the gain is set to some value pi and the 
nudged task is solved, resulting on some nudged-optimal policy 7rj[‘ with positive value (of 
the recurrent state sj) > 0. The geometry of this is shown in the top left plot. As is the 
case with Lemma [T4] and Figure the region shaded light grey cannot contain the w — I 
mapping of any policies without contradicting the optimality of vrjj' , while the dark grey area 
represents the next enclosing triangle. If, next, the gain is set to p 2 and the same policy is 
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found to be optimal, 'k\ = but now with < 0, not only is the geometry as shown in 
the top right plot, again with no policies mapping to the light gray area, but remarkably 
we can also conclude that the optimizer of both cases is also the gain optimal policy of the 
global task. Indeed, the bottom plot in Figure [TO] shows in light grey the union of the areas 
that cannot contain policy mappings, in principle this would reduce the uncertainty region 
to the dark grey enclosing triangle. However, ttJ = vr^ = vr* is known to map to a point in 
each of the two solid black line segments, so it must be on their intersection. Since this is 
the extreme vertex of the new enclosing triangle in the direction of increase of the level sets 
of the average-reward problem and a policy, namely vr*, is known to reside there, it then 
must solve the task. 



Figure 10: Top: Geometry of solving a nudged task for two different set gains p\ and p 2 - 
Bottom: If the same policy tt* is the optimizer in both cases, it lies on the 
intersection of the solid black lines and solves the average-reward task. 


Observe that this argument holds for the opposite change of sign, from v\ < 0 to z'l > 0. 
The general observation can be, thus, generalized: 
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Lemma 15 (Termination by zero crossing) If the same policy vr* = 7r*_|_^ is optimal 
for two consecutive nudges pi and pi+i and the value of the recurrent state changes signs 
between them, the policy is gain-optimal. 


Remark 16 A similar observation was made by Bertsekas (1998), who observed that the 
optimal value of a reference state in SSPs is a concave, monotonically decreasing, piecewise 
linear function of gain. However, as a consequence of value-drift during learning, unlike all 
of the methods in the literature only optimal nudging can rely for termination on the first 
zero-crossing. 


The final step remaining to our formulation is, at the start of an iteration of Algorithm 
deciding the set gain, which geometrically corresponds to choosing the location of vertex 
of the next pencil of lines. The next section shows how to do this optimally. 


4.4.3 MINMAX Uncertainty 

We have already discussed the implications of setting the current gain/nudging to some 
value Pi (for which upper and lower bounds are known), and then solving the resulting 
cumulative reward task. The problem remains finding a good way to update pi. It turns 
out that the updates can be done optimally, in a minmax sense. 

We start from an arbitrary enclosing triangle with vertices A, B, and C. By definition 
of enclosing triangle, the slope of AB is unity. We will refer to the slope of the BC segment 
as mp and that of AC as m^. For notational simplicity, we will refer to the projection to 
the w = —I line, in the direction of some slope m^, of a point X with coordinates {wx, lx), 
as the “C projection of X”, A^. This kind of projection has the simple general form 

mc_wx - lx 
A/- = - . 

+ 1 

Naturally, attempting a —1-projection leads to an indetermination. On the other hand, if 
is inhnite, then = wx- 

An enclosing triangle with vertices A = (0,0), B = (y, y) and C = {D,0), then, has a 
/9-uncertainty of the form 

P<^<Q, 

Ai=Bi<^<Ci , 

WA - Ia = WB - Ib < p* < WC - Ic ■ 

Assume px is set to some value inside this uncertainty region. The goal is to hnd pi as 
the best location for px. Solving the / 03 ;-nudged problem, that is, the cumulative task with 
rewards (r — Pxk), the initial state sj has optimal nudged value v*. Disregarding the cases 
in which this leads to immediate termination, the resulting geometry of the problem would 
be similar to that shown in the top middle, top right, and bottom middle plots in Figure 
In all three cases, the resulting reduction in uncertainty is of the form 

P' = B[<^ <C[ = Q' . 
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We call the resulting, reduced uncertainty for both cases with v* < 0 (Figure]^ top middle 
and top right) left uncertainty. It has the following, convenient features: 


Lemma 17 (Left Uncertainty) For any enclosing triangle with vertices A, B, C: 

1. For any Ai = Bi < ^ < Ci, the maximum possible left uncertainty, u*i occurs when 

the line with slope from (^, — ^) intercepts AB and BC at point B. 

2. When this is the case, the maximum left uncertainty is 

(PSL _ R i 


3. This expression is a monotonically increasing function of px- 

4- The minimum value of this function is zero, and it is attained when ^ = Bi. 


The proof of Lemma 17 is presented in Appendix B. 

The maximum left uncertainty for some px in an enclosing triangle, described in Equa¬ 


tion (20), is a conic section, which by rearrangement of its terms can also be represented 


using the homogeneous form 


( Px 


0 

1 


{B^ — Cy 

— B.y 


{B.y — C.y) —B.y “ 2 Bl (B-y “ C.y) 



= 0 


( 21 ) 


Conversely to left uncertainty, we call the resulting new uncertainty for p* > 0 (Figure 
bottom middle) right uncertainty. The derivation and optimization of right uncertainty 
as a function of px is considerably more intricate than left uncertainty. The following result 
summarizes its features. 


Lemma 18 (Right Uncertainty) For any enclosing triangle with vertices A, B, C: 

1. For any Ai = Bi < ^ < Ci, the maximum possible right uncertainty, u*, is 

2sJahcd{^-Cy){!^-Cp) + ad{^ - Cy) + bc{^ - Cp) 

K = ^ - , ( 22 ) 

e 

with 

s = sign(m /3 — my) , 
a = (1 - mp) , 
b = {l + mp) , 
c = (1 — my) , 
d = (1 -|- my) , 
c = {d — b) = {my — mp) . 
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2. This is a monotonically decreasing function of px 


3. whose minimum value is zero, attained when ^ = Ci. 


The maximum right uncertainty for some px in an enclosing triangle, 
Equation (22) is also a conic section, with homogeneous form 


described by 


[ Px u* I 


c 

-{b + a) 
-Cic 


— (6 +a) —Cl c 

c (C^ (z T C^ 6) 

{Cp a + C^b) Cl c 


Px \ 

<1=0. (23) 


The criterion when choosing pi, must be to minimize the largest possible uncertainty, 
left or right, for the next step. Given the features of both uncertainty functions, this is 
straightforward: 


Theorem 19 (minmax Uncertainty) 

Pi = argmin max [u ], u*] 

px 


is a solution to 


* * 

Ui = . 

Proof Since maximum left and right uncertainty are, respectively, monotonically increasing 
and decreasing functions of px, and both have the same minimum value, zero, the maximum 
between them is minimized when they are equal. ■ 


Thus, in principle, the problem of choosing pi in order to minimize the possible un¬ 
certainty of the next iteration reduces to making the right hand sides of Equations (20) 
and (22) equal and solving for px- Although finding an analytical expression for this so¬ 
lution seems intractable, clearly any algorithm for root finding can be readily used here, 
particularly since lower and upper bounds for the variable are known (i.e., Ai < ^ < Ci). 

However, even this is not necessary. Since both maximum and minimum left uncer¬ 
tainty are conic sections, and the homogeneous form for both is known, their intersection 


is straightforward to find in 0(1) time, following a process described in detail by Perwass 
(2008). This method only requires solving a 3 x 3 eigenproblem, so the time required to 
perform the computation of pi from the w — I coordinates of the vertices of the current 
enclosing triangle is negligible. For completeness, the intersection process is presented in 
Appendix D, below. 

Algorithm summarizes the optimal nudging approach. 
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Algorithm 3 Optimal Nudging 
Set Bertsekas split 
Initialize (vr and H or Q) 

Estimate D 

Initialize {A = (0,0), B = (y), C = (0, D)) 

repeat 

Compute Pi (conic intersection) 

Set reward scheme to (r — pik) 

Solve by any RL method 

Compute from v* the coordinates of the new enclosing triangle 
until Zero-crossing termination or H'^{si) = 0 


In the following two sections we will discuss the computational complexity of this method 
and present some experiments of its operation. 

5. Complexity of Optimal Nudging. 

In this section, we are interested in finding bounds on the number of calls to the “black box” 
reinforcement learning solver inside the loop in Algorithm It is easy to see that therein 
lies the bulk of computation, since the other steps only involve geometric and algebraic 
computations that can be done in constant, negligible time. Moreover the type of rein¬ 
forcement learning performed (dynamic programming, model-based or model free) and the 
specific algorithm used will have their own complexity bounds and convergence guarantees 
that are, in principle, transparent to optimal nudging. 

In order to study the number of calls to reinforcement learning inside our algorithm, we 
will start by introducing a closely related variant and showing that it immediately provides 
a (possibly loose) bound for optimal nudging. 


Algorithm 4 a Nudging 
Set Bertsekas split 
Set 0 < a < 1 
Initialize (vr and H or Q) 

Initialize {A = (0,0), B = ^), C = (0, D)) 

repeat 

Set ^ = {1 + a)Bi + aCi 
Set reward scheme to (r — pik) 

Solve by any RL method 

Determine from v* the coordinates of the new enclosing triangle 
until Zero-crossing termination or = 0 


Consider the “a-nudged” Algorithmic In it, in each gain-update step the nudging is set 
a fraction a of the interval between its current bounds, for a fixed a throughout. An easy 
upper bound on the reduction of the uncertainty in an a-nudging step is to assume that 
the largest among the whole interval between Bi and y, that is, the whole set of possible 
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left uncertainty, or the whole interval between y and Ci, that is, the complete space of 
possible right uncertainty, will become the uncertainty of the next step. 

Thus, between steps, the uncertainty would reduce by a factor of d = max [a, 1 — a]. 
Since the initial uncertainty has length D, it is easy to see that to bound the uncertainty 
in an interval of length at most e requires a minimum of 

-1 , /D\ 

n > -r log — 

logo \£ J 

calls to reinforcement learning. Consequently, for any a, a-nudging has logarithmic com¬ 
plexity, requiring the solution of O (log (y)) cumulative MDPs. Furthermore the constant 
term is smallest when a = a = 0.5. 

log OL 

Therefore, setting a = 0.5 ensures that the uncertainty range will reduce at least in half 
between iterations. This is obviously a first bound on the complexity of optimal nudging: 
since Algorithm is for practical purposes adaptively adjusting a between iterations and is 
designed to minimize uncertainty, the Algorithm with a = 0.5 can never outperform it. The 
remaining question is whether the logarithmic bound is tight, that is, if for some enclosing 
triangle, the gain update is the midpoint of the uncertainty range and the best possible 
reduction of uncertainty is in half. 

This turns out to be almost the case. Consider the enclosing triangle with vertices 
A = (0,0), B = (IbJb), and C = {wc,0), for a small value of wc- Further, suppose 
that the gain-update step of the optimal nudging Algorithm finds a pi that can also be 
expressed as an adaptive a of the form ^ = {l — ai)Bi + aiCi. Through direct substitution, 
the following values in the expressions for left and right uncertainty can be readily found: 


m-y 

m/3 

Bi 

Cl 

B^ 

s 

a 

b 

c 

Cp 


0 , 

Ib 

Ib — wc 
0 , 

Wc 
2 ’ 

—Ib , 

1 , 

21b — Wc 

Ib — Wc 
Wc 

Ib — Wc 
_ ‘2^1 b 

Ib — Wc 
Ibwc 
21b — Wc 
0 . 


Making the right hand sides of the expressions for left, Equation (20), and right uncertainty. 
Equation (22) equal, in order to find the minmax gain update, and then substituting, we 
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have 


(Ca-BO 2sJab(ef - Cg)(f - C,) + a(f - C^) + b(f - CJ 


Making wc 


n -^7 

ails _ 
oiiWc + 21 b 

• 0, to find Qfj, 


Isj! - aj) + ajWc + ^yaiWc{2lB{^ - «i) + aiWc) 

21b 


CHiIb _ (1 ~ Oli)lB +0 + 0 

21b 21b 

ai = l- Ui , 

1 

ai = - . 


Thus, for the iteration corresponding to this enclosing triangle as wc tends to zero, optimal 
nudging in fact approaches a-nudging with = 0.5. However, the maximum possible 
resulting uncertainty reduction for this case is slightly but strictly smaller than half. 


Uj. = Ui = 


otiWcls _ otiWc 
aiWc + 21b ai + f^ 


wc 


Wc 


4+^ 


< 


Wc 


Hence, the same bound applies for a and optimal nudging, and the number of calls 
required to make the uncertainty interval smaller than e is 

D 


n = O ^log ^— 

A number of further qualifications to this bound are required, however. Observe that, 
at any point during the run of optimal nudging, after setting the gain pi to some fixed value 
and solving the resulting cumulative-reward task, any of the two termination conditions can 
be met, indicating that the global SMDP has been solved. The bound only considers how 
big the new uncertainty range can get in the worst case and for the worst possible enclosing 
triangle. Thus, much faster operation than suggested by the “reduction-of-uncertainty-in- 
half” bound can be expected. 

To illustrate this, we sampled five million valid enclosing triangles in the w — I space for 
D = 1 and studied how much optimizing pi effectively reduces the uncertainty range. For 
the sampling procedure, to obtain an enclosing triangle, we first generated an uniformly 
sampled value for Ib + wb and thereon, always uniformly, in order, Ib — wb, Ia + wa, 
Ic + Wc., and Ic — Wc- From these values we obtained the A, B, and C coordinates of the 


triangle vertices and, from them, through Equations (20) and (22), the location of the gain 


that solves the minmax problem and the corresponding maximum new uncertainty. 


Figure 11 shows an approximation to the level sets of the distribution of the relative 


uncertainty reduction and its marginal expectation over the length of the initial uncertainty. 
After the bound computed above, one can expect that for some triangles with a small initial 
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Figure 11: Distribution and expectation of relative uncertainty reduction as a function of 
initial uncertainty, for a sample of five million artificial enclosing triangles. 


uncertainty, namely those close to the origin, the uncertainty will reduce in just about half, 
as well as for all triangles to have reductions below the 0.5 line. This is indeed the case, 
but, for the sample, the expected reduction is much larger, to, in average, below 20% or 
less of the original uncertainty. Thus, although obviously the actual number of calls to 
the reinforcement learning algorithm inside the loop in optimal nudging is task-dependent, 
gain-uncertainty can be expected to reduce in considerably more tan half after each call. 

Moreover, nothing bars the use of transfer learning between iterations. That is, the 
optimal policy found in an iteration for a set pi, and its estimated value, can be used as 
the starting point of the learning stage after setting pj+i, and if the optimal policies, or the 
two gains are close, one should expect the learning in the second iteration to converge much 
faster than having started from scratch. As a consequence of this, not only few calls to 
the “black box” learning algorithm are required (typically even fewer than predicted by the 
logarithmic bound), but also as the iterations progress these calls can yield their solutions 
faster. 

Although transfer in reinforcement learning is a very active research field, to our knowl¬ 
edge the problem of transfer between tasks whose only difference lies on the reward function 
has not been explicitly explored. We suggest that this kind of transfer can be convenient 
in practical implementations of optimal nudging, but leave its theoretical study as an open 
question. 

6. Experiments 

In this section we present a set of experimental results after applying optimal nudging to 
some sample tasks. The goal of these experiments is both to compare the performance of 
the methods introduced in this paper with algorithms from the literature, as well as to 
study certain features of the optimal nudging algorithm itself, such as its complexity and 
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convergence, its sensitivity to the unichain condition, and the effect of transfer learning 
between iterations. 


6.1 Access Control Queuing Revisited 


Recall the motivating example task from Section 3.4 Figure shows that a simple imple¬ 


mentation of optimal nudging is approximately as good as the best finely tuned version of 
R-learning for that task, while, as discussed having less parameters and updates per step, 
as well as room for speed improvement in other fronts. 

In this section we return to that task to explore the effect of the D parameter. From 
Definition remember that D is a bound on unsigned, unnudged reward, and thus a 
possibly loose bound on gain. Since all the rewards in this task are non-negative, in the 
optimal nudging results presented Section 3A, D could be approximated by setting the 
gain to zero and finding the policy of maximum expected reward before returning to the 
recurrent state with all servers occupied. The first half million samples of the run were used 
for this purpose. The exact value of D that can be found in this task by this method (using 
dynamic programming, for example) is the tightest possible, that is, that value would be 
D = max.,ren j in Definition 

More generally, for tasks with rewards of both signs, a looser approximation of D can 
also be computed by setting the gain to zero and solving for maxim expected gain from the 
recurrent state with rewards r = |r| for all possible transitions. Notice that this process of 
D-approximation would add one call (that is, 0(1) calls) to the complexity bounds found 
in the preceding Section, and would thus have a negligible effect on the complexity of the 
algorithm. 



Figure 12: Effect of varying D on convergence, 
termination condition. 


Left, to the optimal gain. 


Right, to the 


For this experiment, let A be value of the zero-gain cumulative-optimal policy from 
the recurrent state, found exactly using dynamic programming (A = 151.7715). Figure 12 
shows, in solid black lines, the performance of optimal nudging when D = A. As expected. 
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in a very small number of iterations-around four-the current approximation of the gain 
(left plot) becomes nearly optimal and the nudged value of the recurrent state (right plot) 
approaches zero, the termination condition. 

The solid grey lines in both plots show the effect of overestimating D to ten and a hun¬ 
dred times A. Since the number of calls to the black-box reinforcement learning algorithm 
inside the optimal nudging loop grows with the logarithm of D, the number of iterations 
required to achieve the same performance as above increases, in about three for each tenfold 
increase in D. 

On the other hand, the dashed grey lines in the plots show the effect of underestimating 
D. Since, for this case, A is the smallest valid value for D, setting the parameter below 
can cause the loss of theoretical guarantees for the method. However, for some values of 
D < A the method still works, resulting in slightly faster convergence to the optimal gain 
and the termination condition. This is the case as long as the mapped policy cloud remains 
contained inside the triangle in the resulting w — I space. For values of D < 0.6A, some of 
the policies visited by the algorithm have a negative w or I and our implementation of the 


method fails due to negative arguments to the root in the right uncertainty Equation (22). 


Thus, although overestimating D naturally causes the algorithm to converge in a larger 
number of iterations, this increase is buffered by the logarithmic complexity of optimal 
nudging with respect to that parameter. Conversely, underestimating D below its tightest 
bound can accelerate learning somewhat, but there is a cost in loss of theoretical guarantees 
and the eventual failure of the method. 


6.2 The Bertsekas Experimental Testbed 


For this second set of experiments, we consider a number of tasks from the paper by Bert 


sekas (1998), that introduces the state-space splitting process that we use for optimal nudg¬ 


ing. That paper presents two versions of dynamic-programming stochastic-shortest-path 
updates, SSP-Jacobi and SSP-Gauss-Seidel. 

We consider the first two kinds of randomly generated tasks, called simply “Problems 
of Table 1 [or 2]” in the paper; respectively “Tl” and “T2”’ hereon. In both types, each 
task has n states, the rewarc^ of taking each action (s, a) is randomly selected from the 
range (0, n) according to a uniform distribution, for each pair (s,a), the states s' for which 
the transition probability is non-zero are selected according to some rule, and all non-zero 
transition probabilities are drawn from the uniform distribution in (0, 1) and normalized. 
In all cases, to ensure compliance of the unichiain condition, we set the n-th state as 
recurrent for the Bertsekas split and, before normalization, add small transition probabilities 
{p = 10“®) to it, from all states and actions. All the results discussed below are the average 
of five runs (instead of Bertsekas’s two runs) for each set up. 

The results listed below only count the number of sweeps of the methods. A sweep is 
simply a pass updating the value of all states. The SSP methods perform a continuous run 
of sweeps until termination, while each optimal nudging iteration is itself comprised of a 
number of sweeps, which are then added to determine the total number for the algorithm. 


3. In the context of the original paper, taking actions causes in a positive cost and the goal is to minimize 
average cost. For consistency with our derivation, without altering algorithmic performance, we instead 
maximize positive reward. 
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Jacobi 


Gauss-Seidel 

n 

q 

ON/SSP 

ONTS/SSP 

ON/SSP 

ONTS/SSP 

10 

0.05 

3.09 


0.79 

4.11 

0.94 

10 

0.1 

3.30 


0.49 

3.86 

0.61 

10 

0.5 

20.94 


2.56 

8.92 

1.20 

20 

0.05 

1.66 


0.34 

2.02 

0.43 

20 

0.1 

17.45 


2.60 

5.46 

0.85 

20 

0.5 

27.54 


3.67 

18.32 

2.70 

30 

0.05 

3.34 


0.53 

2.29 

0.39 

30 

0.1 

5.16 


0.75 

5.91 

0.73 

30 

0.5 

27.62 


4.11 

12.82 

2.04 

40 

0.05 

3.92 


0.62 

3.96 

0.61 

40 

0.1 

19.74 


3.01 

7.81 

1.27 

40 

0.5 

29.00 


4.47 

15.57 

2.55 

50 

0.05 

7.37 


1.16 

3.43 

0.58 

50 

0.1 

38.35 


6.20 

19.11 

3.26 

50 

0.5 

32.65 


5.20 

17.53 

2.97 


Table 2: T1 tasks. Comparison of the ratio of sweeps of optimal nudging (ON), and op¬ 
timal nudging with transfer and zero-crossing checks (ONTS) over SSP methods. 
Averages over five runs. 


It is worth noting that for each sample, that is, for the update of the value of each state, 
while the SSP methods also update two bounds for the gain and the gain itself, optimal 
nudging doesn’t perform any additional updates, which means that the nudged iterations 
are, from the outset, considerably faster in all cases. 

The T1 tasks have only one action available per state, so obviously there is only one 
policy per task and the problem is policy evaluation rather than improvement. As in the 
source paper, we consider the cases with n between 10 and 50. The transition probability 
matrix is sparse and unstructured. Each transition is non-zero with probability q and we 
evaluate the cases with q G {0.5,0.1,0.05}. As Bertsekas notes, there is a large variance 
in the number of iterations of either method in different tasks generated from the same 
parameters, but their relative proportions are fairly consistent. 

The implementations of optimal nudging for the T1 tasks mirror the Jacobi and Gauss- 
Seidel updates from the original paper for the reinforcement learning step, and for both 
of them include two cases: raw learning in which after each gain update the values of all 
states are reset to zero and the change-of-sign termination condition is ignored, and learning 
with transfer of the latest values between gain updates and termination by zero crossing. 
Preliminary experiments showed that for this last configuration, any significant reduction 
in the number of sweeps comes from the termination condition. 

Table summarizes the ratio of the number of sweeps required by optimal nudging over 
those required by the SSP algorithms. The effect of considering the termination by zero- 
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crossing is striking. Whereas the raw version of optimal nudging can take in average up to 
40 times as many sweeps altogether to reach similar results to SSP, including the change- 
of-sign condition never yialds a ratio higher 10. Considering that inside the sweeps each 
nudged update is faster, this means that in most cases for this task optimal nudging has a 
comparable performance to the SSP methods. Moreover, in many cases, remarkably in the 
more sparse-most difficult-ones with the smallest g, optimal nudging requires in average 
less sweeps than the other algorithms. 

The T2 tasks also have only one action available from each state, for n between 10 and 
50, but the transitions are much more structured; the only non-zero transition probabilities 
from a state i G n} are to states i — 1, i, and i + 1 (with the obvious exceptions for 
states 1 and n). 

Once more for this task we compare the performance of the raw and change-of-sign 
condition versions of optimal nudging with the SSP methods for both Jordan and Gauss- 
Seidel updates. 


n 

Jacobi 

ON/SSP ONTS/SSP 

Gauss-Seidel 
ON/SSP ONTS/SSP 

10 

8.78 

1.55 

7.86 

1.45 

20 

9.04 

1.79 

8.23 

1.69 

30 

5.10 

0.98 

4.68 

0.94 

40 

3.38 

0.68 

3.27 

0.69 

50 

3.56 

0.79 

3.42 

0.78 


Table 3: T2 tasks. Comparison of the ratio of sweeps of optimal nudging (ON), and op¬ 
timal nudging with transfer and zero-crossing checks (ONTS) over SSP methods. 
Averages over five runs. 


Table summarizes for these the ratio of the number of sweeps required by optimal 
nudging over those required by the SSP algorithms. In this case, even for the raw version 
of optimal nudging, it never takes over 10 times as many sweeps as SSP and, notably, the 
ratio reduces consistently as size of the problem grows. Once the zero-crossing condition 
is included, both optimal nudging versions become much faster, requiring less sweeps than 
the SSP methods for the largest tasks. 


6.3 Discrete Tracking 


The final experiment compares the performance of optimal nudging and R-learning on a 
problem with more complex dynamics and larger action (and policy) space. This task is 
a discretization of the “Tracking” experiment discussed in the paper by Van Hasselt and 


Wiering (2007). 


In discrete tracking, inside a 10 x 10 grid with a central obstacle, as pictured in Figure 
13 a moving target follows a circular path, traversing anticlockwise the eight cells shaded 


in light grey. The goal of the agent is, at each step, to minimize its distance to the target. 
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Figure 13: Environment of the tracking task. The target moves anticlockwise from one light 
grey cell to another. The agent (dark grey) can move up to four cells in each 
direction, without crossing the black obstacle or exiting the grid. The goal is for 
the agent to follow the target as closely as possible, moving as little as possible. 


While the target can cross the obstacle an indeed one of its positions is on it, the agent 
can’t pass over and must learn to surround it. 

At each decision epoch, the agent’s actions are moving to any cell in a 9 x 9 square 
centred on it, so it can move at most four cells in each direction. This is sufficient to 
manoeuvre around the obstacle and for the agent to be able to keep up with the target. In 
case of collision with the obstacle or the edges of the grid, the agent moves to the valid cell 
closest to the exit or collision point. 

After taking an action, moving the agent from state s(sa;, Sy) to Sy), and the target 
to t'{t'^,ty), the reward is 

1 

(l + i^x ~ t'xY + + '^'yY) 

Thus, if the agent is able to move to the exact cell that the target will occupy, the rein¬ 
forcement is 1, and it will decay quickly the further agent and target are apart. 

On the other hand, the cost of each action grows with the distance moved, 

C = 1 -|- |Sa; — -t- |Sy — Syl . 

Notice that our assumption that all costs are larger than or equal to one holds for this 
model, and that the policy that minimizes action costs must require the agent to stand still 
at each state. 

As described, this task isn’t unichain and it doesn’t have a recurrent state. Indeed, the 
policy of staying in place has one recurrent set for each state in the task, so the unichain 
condition doesn’t hold, and it is straightforward to design policies that bring and keep the 
agent in different cells, so their recurrent sets have no (recurrent) states in common. 
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To overcome this, we arbitrarily set as recurrent the state with both the target and the 
agent in the cell above and to the right of the bottom-left corner of the grid. From any state 
with the target in its position just before that (light grey cell near the middle left in Figure 
13), the agent moves to the recurrent state with probability 1 for any action. Observe 


that making the transition probability to this recurrent state smaller than one would only 
increase the value of D, but would not change the optimal policies, or their gain/value. 


We average 10 runs of two algorithms to solve this task, optimal nudging and two 
different R-learning set-ups. In all cases, e-greedy action selection is used, with e = 0.5. 
In order to sample the state space more uniformly, every 10 moves the state is reset to a 
randomly selected one. 


For optimal nudging, we use Q-learning with a learning rate a = 0.5 and compute 
the new gain every 250000 samples. Q is initialized in zero for all state-action pairs. No 
transfer of Q is made between nudging iterations. Although an upper bound on D is 
readily available {D < 8, which would be the value of visiting all the same cells as the 
target, including the one inside the obstacle), we still use the first batch of samples to 
approximate it. Additionally, although we keep track of the zero-crossing condition, we opt 
not to terminate the computation when it holds, and allow the algorithm to observe 1.5 
million samples. 

For R-learning, we set the learning rate /3 = 10“^, and a = 0.5 (“R-learning 1”) and 
a = 0.01 (“R-learning 2”). The gain is initialized in zero an all other parameters are 
inherited from the optimal nudging set-up. 


o 

o 

0—1 


I 

* 



- Nudged Q 

-R learning 1 

. R learning 2 



o 

^-- 1 

= 0 500 1000 1500 


X 1000 samples 


Figure 14: Approximation to the optimal gain. The vertical grey line indicates that the 
zero-crossing termination condition has already been met. 


Figure 14 shows the performance of the algorithms, averaged over the 10 runs, to ap¬ 
proximate the optimal gain. It is evident that after one million samples optimal nudging 
converges to a closer value to p* than either R-learning set-up, but in fact the plot describes 
the situation only partially and our algorithm performs even better than initially apparent. 
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For starters, recall that optimal nudging has one update less per sample, that of the gain, 
so there is a significant reduction in the computation time even when the number of samples 
is the same. Moreover, in six out of the ten optimal nudging runs p* was actually found 
exactly, up to machine precision. This is something that R-learning couldn’t accomplish 
in any of the experimental runs, on the contrary, while optimal nudging always found the 
gain-optimal policy, in about half of the cases R-learning (both set-ups) found a policy that 
differed from the optimal in at least one state. 

Additionally, the grey vertical line indicates the point at which the zero-crossing termi¬ 
nation condition had been met in all optimal nudging runs, so, in fact for this particular 
experiment optimal nudging requires only two thirds of the samples to find an approxima¬ 
tion to the optimal gain with only about one third of the error of R-learning. 


7. Conclusions and Future Work 


We have presented a novel semi-Markov and average-reward reinforcement learning method 
that, while belonging to the generic class of algorithms described by Algorithm differs in 
one crucial aspect, the frequency and mode of gain updates. 

While all methods described in the preceding literature update the gain after each sam¬ 
ple, that is after the update of the gain-adjusted value of each state for dynamic program¬ 
ming methods or after taking each action (or greedy action) in model-based or model-free 
methods, the optimal nudging algorithm consists of a series of cumulative tasks, for fixed 
gains that are only updated once the current task is considered solved. 

This delaying of the gain updates is feasible because, once the nudged value for a fixed 
gain is known, it is possible and relatively simple to ensure in each update that the maximum 
possible new gain uncertainty will be minimized. We have introduced the straightforward 
transformation of the policies into the w — I space and, through geometric analysis therein, 
shown how to select the new value of the gain, to fix for the next iteration, such that this 
minmax result is guaranteed. 

The disentangling of value and gain updates has the further advantage of allowing the 
use of any cumulative-reward reinforcement learning method, either exploiting its robust¬ 
ness and speed of approximation or inheriting its theoretical convergence properties and 
guarantees. Regarding the complexity of optimal nudging itself, we have shown that the 
number of calls to the learning algorithm are at most logarithmic on the desired precision 
and the D parameter, which is an upper bound on the gain. 

Also, we have proved an additional condition for early termination, when between opti¬ 
mal nudging iterations the same policy optimizes nudged value, and the optimal value for 
a reference state switches sign. This condition is unique to optimal nudging, since in any 
other method that continuously updates the gain, many sign changes can be expected for 
the reference state throughout, and none of them can conclusively guarantee termination. 


Moreover, our experiments with the set of random tasks from Bertsekas (1998) show that 


this condition can be very effective in reducing the number of samples required to learn in 
practical cases. 

Additionally, compared with traditional algorithms, while maintaining a competitive 
performance and sometimes outperforming traditional methods for tasks of increasing com¬ 
plexity, optimal nudging has the advantage of requiring at least one parameter less, the 
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learning rate of the gain updates, as well as between one and three updates less per sample. 
This last improvement can represent a significant reduction in computation time, in those 
cases in which samples are already stored or can be observed quickly (compared to the time 
required to perform the updates). 

Finally, a number of lines of future work are open for the study or improvement of the 
optimal nudging algorithm. First, Figure [^suggests that each call to the “balck-box” rein¬ 
forcement learning method should have its own termination condition, which could probably 
be set adaptively to depend on the nudged-optimal value of the recurrent state, since in 
the early learning iterations the nudged-optimal policies likely require far less precision to 
terminate without affecting performance. 

Likewise, as mentioned above, transfer learning, in this case keeping the state values after 
updating the gain, can lead to faster termination of the reinforcement learning algorithm, 
specially towards the end, when the differences between nudged-optimal policies tend to 
reduce. Although our observation in the Bertsekas testbed is that transfer doesn’t have the 
same effect as the zero-crossing stopping condition, we would suggest to study how it alters 
learning speed, either on different tasks and for different algorithms or from a theoretical 
perspective. 

On a different topic, it would be constructive to explore to what extent the w — I 
transformation could be directly applied to solve average-reward and semi-Markov tasks 
with continuous state/action spaces. This is a kind of problem that has received, to our 
knowledge, little attention in the literature, and some preliminary experiments suggest that 
the extension of our results to that arena can be relatively straightforward. 

The final avenue for future work suggested is the extension of the complexity results of 
Section 5 to the case in which the black-box algorithm invoked inside the optimal nudging 
loop is PAC-MDP. Our preliminary analysis indicates that the number of calls to such 
an algorithm would be dependent on the term where a is the smallest a such that 
^ = (1 — a)Bi + aCi is a valid optimal nudging update. 

Some sampling (five million randomly-generated valid enclosing triangles, for D = 1) 
suggests that not only is a positive, but it is also not very small, equal to approximately 
0.11. However, since there is no analytic expression for pi, and thus neither for a, proving 
that this is indeed the case is by no means trivial, and the question of the exact type of 
dependence of the number of calls on the inverse of a remains open anyway. 
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Appendix A. 

In this Appendix we prove the remainder of the “reduction of enclosing triangles” Lemma 
M from Section [4.4.11 

Lemma Let the points A = {wa,Ia), B = {wb,Ib), and C = {wc,lc) define an enclosing 
triangle. Setting wb — Ib < Pi < wc — lc and solving the resulting task with rewards {r—pi k) 
to find V* results in a strictly smaller, possibly degenerate enclosing triangle. 

Our in-line discussion and Figure show how to find A', B', and C depending on the 
sign and optimal value of the recurrent state for the fixed gain pi. It remains to show that 
the resulting triangle A'B’C is strictly smaller than ABC as well as enclosing, that is, that 
the conditions of Definition El hold. Those conditions are: 


1 . 

2 . 

3. 

4. 

5. 

6 . 


WB > wa; Ib > Ia- 

Ib—Ia — ^ 
wb-wa 

WA > Ia] wb > Ib] wc > Ic- 


P = 


wb—Ib 

2 


Wa-Ia ^ WQ-lg 
2 — 2 


Q Iq—Ia 

— Wg-WA 


< 1 


Iq—Ib 

Wg-WB 


> 1 


Q. 


For brevity, we will only consider one case, reproduced in Figure 15 below. The proof 
procedure for the other two non-degenerate cases follows the same steps and adds no further 
insight. (In light of Remark 12 


the proof of the degenerate cases is trivial and shall be 


omitted). 



Figure 15: (Detail from Figure]^ Outcome after solving a nudged task within the bounds 
of an arbitrary enclosing triangle. 

Proof For this proof we consider the case in which, after setting the gain to pi and solving 
to find the nudged-optimal policy vr* and its value v* , the line with slope that crosses 
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the point (y, — ^) also intersects the segments AC (at the point C) and AB (at B'). We 
want to show that the triangle with vertices A' = A, B' and C is enclosing and strictly 
smaller than ABC. 

In order to study the whether the conditions listed above hold for A'B'C, we will use 
a simple affine transformation (rotation and scaling) from the w — I space to an auxiliary 
space X — y in which all of A, B, P, and B' have the same horizontal-component (x) value, 
the original P; and likewise C and Q have the same x value, equal to that of Q in w — 1. 

This transformation simplifies the analysis somewhat, particularly the expressions for 
the coordinates of C, but it naturally has no effect on the validity of the proof, and the 
only difference with a proof without the mapping to a: — y is simply the complexity of the 
formulas but not the basic structure or the sequence of steps. 




Figure 16: Affine transformation of the w — I space. Left, original system in w — 1. Right, 
rotated and scaled system in the x — y space. P and Q have the same value in 
the horizontal axis of both planes. 


Figure 16 shows the geometry of the affine mapping from w — I to x 
mation matrix and equations are 


y. The transfor- 


x 

1 

■ 1 

-1 ■ 


w 


w—l 

9 

. y . 

“ 2 

1 

1 


1 


W + l 

. 2 - 


while the inverse transformation is 


w 


1 

1 ■ 


X 


y + x 

1 


-1 

1 


. y. 


y-x _ 


In both the w — I and x — y planes, we have P = Q ~ ^^2 

mapped vertices of the original triangle have coordinates A^y = {xA,yA) = ( P, j, 



Since the gain is bounded by the projection of the enclosing triangle vertices, we can 
express, in either plane, ^ = (1 — a)P + aQ, for some 0 < a < 1. We omit the case with 
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a = 0, since under it and a combination of several other conditions our complexity analysis 
from Section 5 may not hold. 

Regarding the coordinates of the new triangle, as mentioned A' = A, so Ai^y = A^y and 
B' is a point on the AB segment, so we can express its mapping as B'^y = {1 — /3)Axy+/3Bxy, 
for 0 < /3 < 1. C", on the other hand, must be found analytically, as the intersection point of 
the AxyCxy and ^B'^y segments. Observe that the slope of the A^yCxy segment is positive. 
Indeed, since condition 5 holds for ABC, we have 


Ic - IA 
wc - WA 
yc - Xc -yA + XA 


m.y = 


yc - yA 

Xc - XA 


>0 , 
>0 , 

> 1 . 


Solving the system of the two equations of the lines that contain the two segments, the 
X — y coordinates of C'^y are readily found to be 


a{Q - P){m^P - yA) + {yA + /3{yB - yA)){P + «(Q - P)) 

am^iQ - P) + yA + /3{yB - yA) 


(24) 


and 


^ ^ iVA + l3{yB - yA)){am^{Q - P) + yA) 

- P) + yA +P{yB-yA) 

We are ready to begin evaluating for A'B'C' the conditions in the definition of an 
enclosing triangle, one by one. From the premise of the Lemma, since ABC is an enclosing 
triangle, all conditions hold for it. 

1. 

WB' > WA' , 
xb' + ys' >XA + yA , 

P + (1 - l3)yA + /3yB > P + yA , 

PiyB - 2/a) > 0 , 

since (3 > 0, 

ys -yA>0 , 

WB + Ib _ WA + IA ^ „ 

2 2 “ ’ 

{wB - wa) + {Ib -Ia)>0 , 
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which holds because both terms are positive by this same condition for ABC. Con¬ 
versely, 


and as above. 


2 . 


which is true. 


Ib' > Ia' , 

VB' - XB' >yA-XA , 

(1 - P)yA + pyB - P>yA- P , 
/3{yB - ^ a ) > 0 , 


Ib' — I A' _ ^ 

Wb' — WA> ’ 

yB' - xb' -yA + XA = xb' + yB' -XA-yA , 
2xb' = 2xa , 

P = P , 


WA' > I A' , 

XA + yA>yA - XA 
XA = P > 0 , 

2 “ 
wa>Ia , 

which holds. The proof is identical for B\ since x'^ also equals P. For C', 

wc > Ic , 

xc' > 0 , 


since all terms in the denominator in Equation (24) are positive, 

«(Q - P){m^P - yA) + [yA + /3(ys - yA)){P + ol{Q - p)) > o , 
am^P{Q - P) + yAP + /3{yB - yA){P + a{Q - P)) > 0 , 

which is easy to verify to hold since all factors in all three summands are non-negative. 
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4. 


P < 


Wc - Ic 


P < xc 


Since, again, the denominator in Equation (24) is positive, 

a{Q - P){m^P - va) + iVA + Pivs - yA)){P + a{Q - P)) > am^P{Q - P) + P{yA + Piys - yA)) 

al3{Q - P){yB - yA) > 0 , 


which holds because all factors are non-negative. 


5. 


0 < 


Ic — I A' 

wc> — wa' 


< 1 


This condition holds because the AC segment contains A'C, so both have the same 
slope. 


6 . 


Ib' — Ic 


Ib' + P 

Wb' — Wc 


wb' - P 


The numerator is positive. If the denominator is positive as well, 

+ P . .. 
wb' - P~ 

Ib' + P>wb' - P , 

Ib' — wb' -|- 2/9 > 0 , 

wb' — Ib' 
p >—^ . 

P + a(Q-P)>P , 
a{Q - P) > 0 , 

which holds because both factors are nonnegative. If the denominator is negative, 

h' + P . .. 
p-WB' ~ 
h' + P> P- wb' 

Ib' + wb' > 0 , 


which was proved above. If the denominator is zero, the slope of the segment is infinity 
(of either sign), for which the condition trivially holds. 
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It remains showing that Q' < Q, and thus the new enclosing triangle is strictly smaller 
than the initial one. By direct derivation, 


xc < xc , 

q(Q - P){m^P - va) + {vA + P{yB - yA)){P + «(Q - P)) ^ ^ 
amy{Q - P) + yA + - Va) 

am^{Q - Pf + VAiQ - P) + PiQ - P){yB - va) - oif3{Q - P){yB - yA) > 0 ■ 
Since Q > P, 


am^{Q - P) + yAP /?(! - a){yB - > 0 . 


Under our assumption that the ABC triangle is not degenerate, all factors in all summands 
are nonnegative. Furthermore, since we rule out the possibility that a = 0, the first term 
is indeed positive and, thus, the inequality holds. 
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Appendix B. 


In this Appendix we prove the left uncertainty Lemma 17 from Section 4.4.3 
Lemma For any enclosing triangle with vertices A, B, C, 


1. For any Ai = Bi < ^ < Ci, the maximum possible left uncertainty, occurs when 
the line with slope from (^, — ^) intercepts AB and BC at point B. 

2. When this is the case, the maximum left uncertainty is 

* S-i-Bi), . 

~ ^ ^ ^ -°7) • 


3. This expression is a monotonically increasing function of px- 
4- The minimum value of this function is zero, and it is attained when ^ = Bi. 
Proof Figure [T7| illustrates the geometry of left uncertainty. 



Figure 17: Geometry and notation for the proof of the left uncertainty Lemma. If gain is 
fixed to some Bi < ^ < Ci, depending on the optimal nudged value the point 
S may either lie on the BC (pictured) or AB segments, the point T lies on the 
AC segment (which has slope m^), and the left uncertainty is ui = 2{Ti — Si). 


Setting the gain to px, we are interested in computing the left uncertainty as a function 
of the coordinates of S. The ST segment belongs to a line with expression 


I + 


Px 

~2 


I 


ls + ^ 


WS - 


px 

2 



[Is + ^)w - {ws + Is)^ 
WS - ^ 


(26) 
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Conversely, for points in AC^ including T, 


I = m^w — {m^wc — Ic) ) 

I = m^w — (1 + m^)C^ . 

Thus, for the 1-projection of point T, 

wt — It wt — {m^WT — (1 + rn^)C^) 

T\ = -= - 

2 2 

(1 — m^)wT -|- (1 -|- m^)C^ 

~ 2 


( 27 ) 


(28) 


Since point T lies at the intersection of segments AC and ST, its rc-coordinate can be found 
by making Equations (27) and (26) equal and solving: 


m-yWT — (1 + m^)C^ = 


{Is + ^)wT - {ws + Is)^ 


ws 


px 

2 


WT[[ws 


Pa] 
2 / 


“ (^-5 + y)) = (1 + "^7) “ y) <^7 - {'^S + ^s) Y ’ 

(1 -h m^) (ws - Pf)C^- (ws + Is)^ 


Wt = 


(1 -I- m^) (Y - 

Substituting Equation ( [2^ in ( [2^ , 

^ _ (1 - m^) (ws - ^) Y - (1 - m^)(ws + ls)^ + (1 + - Pf)C^ 

^ 2(1 -I- m^) (Y - 

and, finally, after solving, the left uncertainty is 

ui = 2(ri — Si) = (wt — It) — (ws — Is) , 

Xf-Si) (C,-S7 


Ui - 2- 


(f -^ 7 ) 


(29) 


(30) 


Eigure 18 expands Eigure to include the m-^-projections of points S and C to the 
w = —I line. Observe that the projections and gain are ordered in the sequence 


5.V < < Si < 5i < Ti < ^ < Cl 


(31) 


The ordering of the terms between Si and Ci results from setting px and having left 
uncertainty. Si = Si occurs when the point S lies on the AB segment (instead of on SC 
as pictured). Si = Ti is only possible if the gain is set to px = 2Si, in which case all four 
points are concurrent. 

To see that C..y < Si, observe that 


A^ = C-f < Si = Ai , 
m^WA - IA ^ WA - IA 
-|- 1 “ 2 
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Figure 18: Expansion of Figure 17 to present the geometry of all relevant terms in the 
expression of left uncertainty for an arbitrary S, Equation (30). 


Since niry > 0, we can cross-multiply the denominatiors to obtain 


m^{wA + Ia) <wa + Ia ■ 

Given wa + Ia is positive (i.e., A belongs to the w — I space), this requires < 1, which 
holds by definition of enclosing triangle (see condition 5 in Definition 0 - 

Finally, to ensure that < C^, the point B must not lie on the AC segment, but this 
condition holds trivially for a non-degenerate enclosing triangle with vertices A, B, and C. 
We are now ready to prove all items in the Lemma. 


1. To see that left uncertainty is maximum when S = B, there are two cases: If S' = 
{ws,ls) £ AB, consider another point, slightly above on that segment, with coordinates 
S' = {ws + e,ls + s), for some valid e > 0. Obviously, Ai = Bi = Si = S'^ < ^. Leaving 
all other parameters equal, for the case with S' to have larger left uncertainty than 5, from 
Equation 

(t-s,) ^ (f-sy 


Since Si = and, from Equation (31), both are smaller than the first term in the 
numerator can be cancelled from both sides. Additionally, also from Equation (31), since 
^ is larger than S^ (and S'^), we can cross-multiply the denominators, to obtain 


Pa^S^-S'^)>C,{S,-S'^) . 


The common term is equivalent (by direct substitution) to 


S^ 



1 — 
1 -|- 
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Given that < 1 and e > 0 this term is nonnegative, so the only requirement is for 


^>a 


7 ’ 


which holds after EquationThus, the position of S on AB for maximum left uncertainty 
is the furthest possible up, that is, on point B. 

Conversely, if S' = {ws, Is) £ BC consider another point in the segment, closer to B, 
with coordinates S' = {ws + + e), where |m^| > 1 is the slope of the (line that 

contains that) segment and, once more, e > 0. For the case with S' to have larger left 
uncertainty than S, the following must hold: 


„ (f - Si) (C, - S,) , „ (f - S[) (C., - g;) 


(f - S7) 


(f - sQ 


(32) 


In this case, although the first terms in the numerator of both sides aren’t equal, we still 
can omit them, since it is easy to show that 




which reduces to 


Px - {ws - Is) < Px- {w's - I's) 


£ -> 0 

m/3 


(33) 


which is true: if m /3 < — 1 both summands become positive and if m /3 > 1 the second term 
is strictly smaller than the first, so their difference is positive, (m /3 = 1 corresponds to a 
degenerate case that escapes the scope of this Lemma). 


Thus, Equation (32), as in the case above, simplifies to 


px 


^{S^ - S'J > C^{S^ - S'A , 


and in this case S^ — S!y > 0 reduces itself to the form in Equation (33), which was proved 


above. Hence, once more we require that ^ > C^, which was proved already. Thus, for 
points in the BC segment, left uncertainty is maximum when S = B. 


2 . From the preceding discussion, substituting S for B in Equation (30), 

-Hi) 


ui=2 


(f-H,) 


{C^-B,) 
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3 . To prove the increasing monotonicity of this expression with respect to px, we fix all 
other parameters and study the effect of moving the gain from px to px + e, for some e > 0. 
If the function is monotonically decreasing, the following must hold, 


M-B,){C,-B,) ^ ^(2^ - B,) (C-, - B-,) 


(t - BP 


(2^ - B.,) 


After cancelling the equal terms on both sides and cross-multiplying (the positivity of 
the denominators was already shown above), the expression immediately reduces to 


e 

2 


< 



which trivially holds by the positivity of e and Equation (31) 


4. Since increases monotonically with px, its minimum value must occur for the smallest 
Px, namely ^ = Bi, for which the first factor in the numerator, and hence the expression, 
becomes zero. 
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Appendix C. 


In this Appendix we prove the right uncertainty Lemma 18 from Section 4.4.3 
Lemma For a given enclosing triangle with vertices A, B, C, 

1. for any Ai = Bi < ^ < Ci, the maximum possible right uncertainty, u*, is 
2s^ahcd{^ - - Cp) + ad{^ - C^) + bc{^ - Cp) 


( 34 ) 


with 


s = sign(m^ — m^) , 
a = (1 - mp) , 
b= {1 + mp) , 
c = (1 — m^) , 
d = (1 + m^) , 
e = {d — b) = {m^ — mp) . 

2. This is a monotonically decreasing function of px, 

3. whose minimum value is zero, attained when ^ = Ci. 

Proof Figure [T9| illustrates the geometry of right uncertainty. 



Figure 19: Geometry and notation for the proof of the rightt uncertainty Lemma. If gain 
is fixed to some Bi < ^ < Gi, the point S lies on the BC segment (which 
has slope mp), T lies on the AC segment (which has slope m-j,), and the rightt 
uncertainty is Ur = 2 (S'i — Ti). 
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In this case, we have the point S G BC and T = ^S{^AC. The resulting new right 


uncertainty is Ur = 2(5i — Ti). Following the same derivation as Equation (30) in Appendix 
B, 


— 2 


(5i - f) (C, - S,) 


(35) 


From this expression it is clear that in the degenerate cases when ^ (reduction of the 
enclosing triangle to a line segment with slope unity) and S = C (reduction to the point 
C), right uncertainty is zero. 

As was the case in for left uncertainty, Figure [T^ suggests an ordering of projections to 
the w = —I line that, for non-degenerate initial and reduced enclosing triangles, is 

< Bi < ^ < Ti < Si < Cl < C0 . (36) 

As a direct consequence of this ordering, not only is = 0 at the two extreme positions 
of S, but it is also strictly positive for intermediate points. For an arbitrary, fixed px, then, 
the location of S of interest is that of maximum right uncertainty. Since point S lies on a 
segment (BC) of a line whose expression is known, we can solve the problem for just one 
of its components. To find the maximizer, we take the derivative (hereon denoted with 
of Ur with respect to ws, 


u=2 


u; = 2 


[(»! - t) {C-, - 5,)]' (f - SP - (»i - f) {C, - 5,) (fe - S,)' 

Ct-sC 

5'i(C,. - S-,) (f - S,) + g; (5i - f) {C-, - f) 


At the optimizer, the numerator of this fraction must equal zero. 
Since S lies on the line with slope mp that joins B and C, 


h = mpws - rnpwc + Ic 
Is = mpws - (1 + mi3)Cp , 
I's = mp . 


Additionally, from the definition of the projections of S, 


and 


= 
5( = 


ws - Is 
2 

l-mp 


S’; 


m^ws - Is 
-|- 1 
— mp 
m-y + 1 
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If we define 


a = 1 — mfi , 
b = l + mp , 
c = 1 — , 

d = 1 + m-y , 


and 


e = m.y — rriis = d — b , 

then, the expression for = 0 at the optimizer, 

becomes 

a (d^ - ew* - bC^^ (dC^ - ew* - bC^) + de (aw* + bC^ - p^) = 0 . 

This is a quadratic expression on w*, of the form 

JA(wlf + ‘Bw* + C = 0 , 


with 


J^ = ae 


B = 2ae ( bCp — d 


jPx 


C = (ad^ - 2de)(jC^ - (abd + bde)^Cp + (bde - abd)CpC^ + ab^Cj + 2de . 

Applying the quadratic formula and solving, 

2ae (d^ — bCp) ± yjAabcde^ “ ^i) ~ 


w, = 


2ae2 


dlf - bCp 




(37) 


where s is the appropriate sign for w* to be a maximizer. Notice that, for an enclosing 
triangle, the expression inside the radical is always non-negative. Since 0 < < 1, c is 

positive and d non-negative. Also, as Impl < 1, a and b are of opposite signs and, from the 
ordering in Equation (36), (^ — C^) is positive and (^ — Cp) negative, so there will be, 
in any case, two negative factors. In order to determine s, we take the second derivative of 
Ur and evaluate it in the point w*, which simplihes to 


u''(w*) = 


2a^e 


d"^seabed - C^) - Cp) 
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For the second order necessary condition for a maximum to hold, 

u”{w*) < 0 , 


which requires that 


s = — sign(e) = sign(m /3 — m^) . 


Observe that for the opposite sign of s, the corresponding wg is a minimizer. 
Substituting Equation (37) in Equation (35), 

2s Jabed (f - (f - C/?) + ad (f - C^) + 6c (f - Cp) 

ul = -- 


To show that this is a monotonically decreasing function of p, we factor (—s) and use 
the fact that, since it is a sign, = (—s)^ = 1, to obtain 


u 


* 

r 


U 


* 

r 


ad{-s) -C^) - 2^Jad{-s) - C^) hc{-s) - Cp) + hc{-s) 

(-s)e 

{yj-sad (f - - yjsbe (f - C^)) 


Cp) 


(38) 


The expressions inside both radicals are always positive. Eor a non-degenerate enclosing 
triangle, recall that Cp > ^ > C^, a and 6 are positive and \m^\ >1. If > 1, then 
s = —1, so all factors in the hrst root are positive, and c is negative, so there are exactly two 
negative factors in the second root. Conversely, if < —1, then s = 1 and d is negative, 
so both roots contain exactly two negative factors. 

To prove decreasing monotonicity, it is sufficient to show that the first derivative of the 
function is negative in the range of interest. Taking it. 


du* 

aBl 

2 


—sad [ — — 


-sbe I I 


■\ / —sad^j—sbe (^ — Cp) + sbe^j—sad (^ — 
y V ^a6cd(f-C,) (f-Cp) 


By a similar reasoning to the positivity of the arguments of the roots, the term in the last 
parenthesis can be shown to be always positive. Thus, for the derivative to be negative. 


Since both terms are positive an —se = |e|, this is equivalent to 

adC^ — bcCp px 

^ ^ y 
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Substituting the definition of all the terms and solving, this reduces to 


(m^ - mi3)wc - jm^ - mii)lc ^ ^ 
2(m^ — m/^) 2 ’ 


which, by the ordering in Equation (36), holds. 


Finally, to see that maximum right uncertainty is zero in the extreme case when ^ = Ci, 
notice that in that case the only possible position for S, and therefore S* is the point C, so 


Equation (35) becomes 


u* = 2 


(Cl - f) jC^ - C^) 
(f-C,) 


which trivially equals zero. 
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Appendix D. 


In this section we include the conic intersection method described in detail by Perwass 


(2008). The Lemmas that support the derivation of the method are omitted here, but they 
are clearly presented in that book. 

This method finds the intersection points of two non-degenerate conics represented by 
the symmetric matrices A and B. 


1. Let M = B-^A. 

2. Find A, any real eigenvalue of M. (Since the matrix is 3 x 3, at least one real eigenvalue 
exists). 

3. Find the degenerate conic C = A — XB. 

4. C represents two lines. Find their intersections with either A or B. 

5. These (up to four points) are the intersections of the two conics. 


In the case of the gain update step in the optimal nudging Algorithm A and B are 
the homogeneous forms of the expressions for maximum left and right uncertainty, from 
Equations (21) and (23). 

For the optimal nudging update stage, an additional step is required in order to deter¬ 
mine precisely which of the four points corresponds to the intersection of the current left 
and right uncertainty segments of the conics. This is easily done by finding which of the 
intercepts corresponds to the gain of point in the w = —I line inside the AiBi segment. 

In our experience, this verification is sufficient to find the updated gain and there are no 
multiple intercepts inside the segment. However, if several competing candidates do appear, 
an additional verification step my be required, to determine which is the simultaneous unique 
solution of Equations (20) and (22). 
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